SIEVES:通过视觉证据评分实现选择性预测的泛化 / SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
1️⃣ 一句话总结
该论文提出了一种名为SIEVES的方法,通过让多模态大模型在回答时提供可视化的证据区域,并训练一个评估模块来量化这些证据的质量,从而在不确定时选择不回答,显著提升了模型在面对复杂或分布外数据时的可靠性和覆盖率。
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
SIEVES:通过视觉证据评分实现选择性预测的泛化 / SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
该论文提出了一种名为SIEVES的方法,通过让多模态大模型在回答时提供可视化的证据区域,并训练一个评估模块来量化这些证据的质量,从而在不确定时选择不回答,显著提升了模型在面对复杂或分布外数据时的可靠性和覆盖率。
源自 arXiv: 2604.25855