菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-25
📄 Abstract - AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: this http URL

顶级标签: multi-modal llm benchmark
详细标签: multimodal evaluation cultural reasoning audio-visual understanding meme comprehension contextual knowledge 或 搜索:

AVMeme测试:一个用于评估大语言模型情境与文化知识与思维的多模态多语言多文化基准 / AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking


1️⃣ 一句话总结

这篇论文提出了一个名为AVMeme Exam的基准测试,通过评估AI模型对网络流行音视频(如音乐、音效)在文化背景下的理解能力,发现当前最先进的多模态大模型在脱离文本的音频理解和结合文化情境的思考方面存在明显不足。

源自 arXiv: 2601.17645