菜单

🤖 系统
📄 Abstract - World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

顶级标签: multi-modal model evaluation natural language processing
详细标签: vision-language models cultural understanding visual question answering benchmark robustness 或 搜索:

一帧中的世界:理解文化混合作为视觉语言模型的新挑战 / World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models


1️⃣ 一句话总结

这篇论文指出,当来自不同文化的元素(如食物和背景)同时出现在一个画面中时,现有的大型视觉语言模型难以准确识别并保持它们各自的文化身份,为此作者创建了一个名为CultureMix的评测基准,并发现通过使用包含文化混合数据的监督微调可以有效提升模型在此类场景下的表现。


📄 打开原文 PDF