菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's \kappa=0.70) but break down on fine-grained, part-level judgment (\kappa=0.10), validating the paradigm in its strong regime while clarifying its limits.

顶级标签: benchmark model evaluation multi-modal
详细标签: human motion understanding vision-language models question answering video captioning error correction 或 搜索:

NextMotionQA:使用视觉-语言模型基准测试与评判人体运动理解 / NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models


1️⃣ 一句话总结

本文提出了 NextMotionQA 基准测试,通过多项选择题、视频描述和细粒度纠错三种任务,系统评估视觉-语言模型对人体运动的理解能力,并揭示了模型在简单任务上表现尚可、但在精细部件级别判断上严重失效的局限。

源自 arXiv: 2606.04773