菜单

🤖 系统
📄 Abstract - CaptionQA: Is Your Caption as Useful as the Image Itself?

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at this https URL.

顶级标签: model evaluation multi-modal natural language processing
详细标签: image captioning benchmark utility evaluation multimodal llm qa-based assessment 或 搜索:

CaptionQA:你的图像描述是否和图像本身一样有用? / CaptionQA: Is Your Caption as Useful as the Image Itself?


1️⃣ 一句话总结

本文提出了一个名为CaptionQA的新基准,通过量化图像描述(caption)在多大程度上能替代原始图像以支持下游任务(如检索、推荐、具身AI等),来评估描述的质量,揭示了当前最先进模型生成的描述在实用性上与原始图像存在显著差距。


2️⃣ 论文创新点

1. 提出以“效用”为核心的评估范式

2. 构建领域特定、细粒度的分类法驱动基准

3. 设计基于确定性问答的轻量级评估协议

4. 系统化的提示策略评估与优化


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

📄 打开原文 PDF