菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-24
📄 Abstract - C3-Bench: A Context-Aware Change Captioning Benchmark

While Change Captioning systems have garnered substantial attention to respond to our evolving world, their true performance on diverse real-world change contexts remains largely unexplored due to the lack of comprehensive evaluation frameworks. To fill this gap, we propose C3-Bench, a comprehensive benchmark for evaluating Context-aware Change Captioning. C3-Bench features: (1) 4,996 human-labeled image pairs of 51 real-world change contexts across four domains (e.g., natural scenes, remote sensing imagery, image editing, and anomalies), each with diverse, carefully curated scenarios derived from multiple change-centric communities; and (2) the first LLM-as-Judge evaluation framework in the change captioning task that measure fine-grained dimensions (e.g., correctness, specificity, fluency, and relevance), along with a novel reversibility metric exploring whether models understand changes with symmetric consistency. Based on C3-Bench, we benchmark 32 models -- including conventional change captioning models, proprietary Large Multimodal Models (LMMs), and 2B-90B open-source LMMs. We reveal a fundamental blind spot in the prevailing change captioning paradigm: Once the change context departs from training-style regimes, conventional models collapse, and even state-of-the-art LMMs such as GPT-5.2 exhibit systematic domain- and position-dependent errors that distort reliable change understanding. By making these hidden failure modes explicit and measurable, we delineate the next frontier for building generalizable and trustworthy change captioning systems. All codes and datasets are publicly available on the project page.

顶级标签: computer vision benchmark multi-modal
详细标签: change captioning evaluation llm-as-judge reversibility metric 或 搜索:

C3-Bench:一种上下文感知的变化描述基准 / C3-Bench: A Context-Aware Change Captioning Benchmark


1️⃣ 一句话总结

为了解决现有变化描述系统在真实世界中表现评估不足的问题,该论文提出了一个包含近5000组人工标注图像、覆盖自然场景、遥感、图像编辑和异常检测四个领域51种真实变化情境的基准测试集C3-Bench,并首次引入大语言模型作为裁判来细致评估描述的正确性、具体性、流畅性和相关性,结果发现当前主流模型(包括GPT-5.2等顶级多模态模型)在面对与训练数据风格不同的新场景时会出现系统性错误,揭示了实现通用可靠变化描述系统的关键研究方向。

源自 arXiv: 2606.25445