菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

顶级标签: medical model evaluation multi-modal
详细标签: vision-language models clinical terminology evaluation metrics radiology reports demographic fairness 或 搜索:

衡量视觉语言模型未言之事:验证指标掩盖了放射学报告生成中的临床术语擦除 / Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation


1️⃣ 一句话总结

这篇论文指出,当前评估放射学报告生成模型的方法存在盲点,即模型可能为了获得高分而生成重复、安全的通用文本,从而丢失关键的临床术语,为此作者提出了新的词汇多样性衡量框架来量化这种临床信息损失和潜在的偏见风险。

源自 arXiv: 2603.01625