ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

📄 Abstract - ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.

ViTaB-A：评估多模态大语言模型在视觉表格归因任务上的表现 / ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

1️⃣ 一句话总结

这篇论文研究发现，当前的多模态大语言模型在回答基于表格的问题时，虽然能给出正确答案，但很难准确地指出答案具体来源于表格中的哪些行和列，这使得它们在需要透明度和可追溯性的应用中并不可靠。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要