规范与指代碰撞之处:评估大语言模型的规范性推理能力 / Where Norms and References Collide: Evaluating LLMs on Normative Reasoning
1️⃣ 一句话总结
这项研究通过一个名为SNIC的诊断测试平台发现,即使是当前最先进的大语言模型,在处理需要结合物理和社会背景来理解隐含行为规范的指代任务时,仍然存在明显不足,这揭示了其在应用于具身智能等社会性场景中的一个关键短板。
Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.
规范与指代碰撞之处:评估大语言模型的规范性推理能力 / Where Norms and References Collide: Evaluating LLMs on Normative Reasoning
这项研究通过一个名为SNIC的诊断测试平台发现,即使是当前最先进的大语言模型,在处理需要结合物理和社会背景来理解隐含行为规范的指代任务时,仍然存在明显不足,这揭示了其在应用于具身智能等社会性场景中的一个关键短板。
源自 arXiv: 2602.02975