菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-08
📄 Abstract - ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

顶级标签: multi-modal benchmark machine learning
详细标签: visual question answering cultural reasoning heritage understanding bilingual vqa vision-language models 或 搜索:

ChinaHeritaQA:面向中国世界遗产的文化视觉问答数据集 / ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China


1️⃣ 一句话总结

该论文构建了一个包含中国世界遗产图像和双语问答对的多模态基准数据集,评估了视觉语言模型在文化推理上的能力,发现现有模型虽擅长视觉识别,但在理解历史、朝代等深层文化知识方面仍有明显不足。

源自 arXiv: 2606.08959