揭示波斯语语言模型中事实与概念的差距 / Unmasking the Factual-Conceptual Gap in Persian Language Models
1️⃣ 一句话总结
这篇论文通过引入一个名为DivanBench的新评测基准,专门测试波斯语大语言模型对迷信和习俗等复杂社会规范的理解,发现这些模型虽然能记住文化事实,却难以在实际情境中进行推理,暴露出严重的‘附和偏见’和事实应用能力不足的问题。
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
揭示波斯语语言模型中事实与概念的差距 / Unmasking the Factual-Conceptual Gap in Persian Language Models
这篇论文通过引入一个名为DivanBench的新评测基准,专门测试波斯语大语言模型对迷信和习俗等复杂社会规范的理解,发现这些模型虽然能记住文化事实,却难以在实际情境中进行推理,暴露出严重的‘附和偏见’和事实应用能力不足的问题。
源自 arXiv: 2602.17623