简洁为上:函数调用语言智能体中思维链预算的非单调性效应 / Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
1️⃣ 一句话总结
这项研究发现,在让AI调用工具完成任务时,简短思考(约8-32个词)能大幅提升准确性,而过长思考反而会损害性能,并据此提出了一种能避免AI“胡思乱想”的结构化简短思考方法。
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.
简洁为上:函数调用语言智能体中思维链预算的非单调性效应 / Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
这项研究发现,在让AI调用工具完成任务时,简短思考(约8-32个词)能大幅提升准确性,而过长思考反而会损害性能,并据此提出了一种能避免AI“胡思乱想”的结构化简短思考方法。
源自 arXiv: 2604.02155