语义内容决定算法性能 / Semantic Content Determines Algorithmic Performance
1️⃣ 一句话总结
这篇论文通过一个名为WhatCounts的测试发现,前沿大语言模型在执行‘计数’这类简单算法任务时,其准确性会因被计数对象(如城市、化学物质)的语义类型不同而产生超过40%的波动,这表明模型并非真正执行算法,而是对算法进行与输入语义相关的近似模拟,这一特性可能广泛存在于各种LLM功能中。
Counting should not depend on what is being counted; more generally, any algorithm's behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.
语义内容决定算法性能 / Semantic Content Determines Algorithmic Performance
这篇论文通过一个名为WhatCounts的测试发现,前沿大语言模型在执行‘计数’这类简单算法任务时,其准确性会因被计数对象(如城市、化学物质)的语义类型不同而产生超过40%的波动,这表明模型并非真正执行算法,而是对算法进行与输入语义相关的近似模拟,这一特性可能广泛存在于各种LLM功能中。
源自 arXiv: 2601.21618