菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-21
📄 Abstract - Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

Purpose: To evaluate whether large language model (LLM)-assisted label cleaning can identify label-report discordance in CT-RATE, a large-scale public chest CT dataset. Materials and Methods: After report-level deduplication, 24,446 unique radiology reports were identified. Twelve reports were excluded from the primary GPT-5.4 analysis because of Microsoft Azure AI Foundry content-safety filtering, leaving 24,434 reports and 439,812 label instances across 18 abnormality categories. GPT-5.4-derived binary labels were generated from report text using structured JSON output and compared with existing CT-RATE labels. Discordant instances were adjudicated by radiologists. In addition, 100 randomly sampled reports were manually annotated to compare CT-RATE labels, individual LLM-derived labels, and multi-LLM majority-vote labels against radiologist-annotated reference labels. Results: Overall agreement between GPT-5.4-derived and CT-RATE labels was 96.4%, with Cohen's kappa of 0.884. Lymphadenopathy showed the lowest agreement and kappa. In discordance review, radiologist adjudication supported GPT-5.4-derived labels in 72 of 97 (74.2%) general discordant instances and 91 of 99 (91.9%) targeted lymphadenopathy discordant instances. Against radiologist-annotated reference labels, multi-LLM majority-vote labels achieved the highest label-macro-averaged F1 score and Cohen's kappa. Conclusion: LLM-assisted label cleaning identified clinically meaningful label-report discordance in CT-RATE and may support scalable quality improvement of public imaging datasets. The cleaned dataset will be made publicly available to support future research.

顶级标签: medical llm data
详细标签: label cleaning chest ct radiology dataset quality 或 搜索:

大型语言模型辅助清洗大规模胸部CT数据集中报告衍生标签 / Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset


1️⃣ 一句话总结

本研究利用GPT-5.4等大型语言模型自动检测并修正了大规模公开胸部CT数据集(CT-RATE)中标签与放射报告不一致的问题,发现模型在绝大多数争议案例中支持语言模型的判断,且多模型投票提升标签质量,最终提供了更干净的公开数据集以支持未来研究。

源自 arXiv: 2606.22382