DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

📄 Abstract - DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in this https URL.

DiningBench：一个用于饮食领域感知与推理的分层次多视角基准 / DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

1️⃣ 一句话总结

这篇论文提出了一个名为DiningBench的新基准测试，它通过包含多角度图片和精细分类的饮食数据，来全面评估AI模型在识别菜品、估算营养和回答食物相关问题上的能力，发现现有模型在细节分辨和精确营养推理方面仍有明显不足。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要