菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies five cross-lingual variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps evidenced by the absence of culture-specific content in favor of culturally neutralized templates, and language-specific institutional referral behavior in emotional support responses. We argue that ILR-informed expert judgment applied to LLM outputs constitutes a novel and underreported evaluation methodology that complements purely computational benchmarks, and that cross-lingual output variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.

顶级标签: llm natural language processing model evaluation
详细标签: cross-lingual response consistency ilr framework claude evaluation multilingual 或 搜索:

大语言模型中的跨语言回复一致性:基于ILR标准的六种语言Claude评估 / Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages


1️⃣ 一句话总结

本文基于跨机构语言圆桌(ILR)技能描述框架,系统评估了Claude模型在六种语言(英语、法语、罗马尼亚语、西班牙语、意大利语和德语)上的回复一致性,发现不同语言在回复长度、创意表达、礼貌策略、技术术语和文化校准等方面存在显著且系统性的差异,表明跨语言输出变化是可解释的、分领域的,并对公平的多语言AI部署具有重要影响。

源自 arXiv: 2604.27137