菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-10
📄 Abstract - BabyVision: Visual Reasoning Beyond Language

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at this https URL for reproduction.

顶级标签: multi-modal model evaluation benchmark
详细标签: visual reasoning multimodal llms cognitive gap evaluation framework human baseline 或 搜索:

BabyVision:超越语言的视觉推理 / BabyVision: Visual Reasoning Beyond Language


1️⃣ 一句话总结

这篇论文通过创建名为BabyVision的基准测试,揭示了当前最先进的多模态大语言模型在无需语言辅助的基础视觉推理能力上,甚至远不及三岁儿童的水平,表明它们在核心视觉感知方面存在根本性缺陷。

源自 arXiv: 2601.06521