The Abstraction Gap in Vision-Language Causal Reasoning

📄 Abstract - The Abstraction Gap in Vision-Language Causal Reasoning

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

视觉语言因果推理中的抽象鸿沟 / The Abstraction Gap in Vision-Language Causal Reasoning

1️⃣ 一句话总结

本文发现当前的视觉语言模型虽然能流利地生成因果解释，但在真正进行因果推理时表现很差，并提出了一个‘抽象鸿沟’指标来量化这种语言流畅度与因果推理能力之间的差距，实验表明多数模型存在显著鸿沟且难以通过微调弥补，但某些模型结构设计有可能缩小这一差距。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要