排名优秀但概率错误:对多模态癌症生存模型的校准性审计 / Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models
1️⃣ 一句话总结
这篇论文首次系统性地审计了结合病理图像和基因数据的癌症生存预测模型,发现这些模型虽然能很好地对患者进行风险排序,但其预测的生存概率往往不准确,这对于临床应用的可靠性构成了挑战。
Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use.
排名优秀但概率错误:对多模态癌症生存模型的校准性审计 / Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models
这篇论文首次系统性地审计了结合病理图像和基因数据的癌症生存预测模型,发现这些模型虽然能很好地对患者进行风险排序,但其预测的生存概率往往不准确,这对于临床应用的可靠性构成了挑战。
源自 arXiv: 2604.04239