更多测试时计算可能有害:大语言模型束搜索中的高估偏差 / More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
1️⃣ 一句话总结
这篇论文发现,在大型语言模型的推理过程中,盲目增加束搜索的宽度(即考虑更多候选路径)反而可能降低输出质量,其根本原因在于评分器的噪声会导致系统性的高估偏差,而决定最佳搜索宽度的关键因素是评分器输出信号与噪声的比值。
Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width $\hat{k}$ beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: $\hat{k}$ grows exponentially with $(\Delta/\sigma)^2$, where $\Delta > 0$ is the quality advantage of correct paths over incorrect ones and $\sigma$ is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN (5,975 questions). Perplexity scoring, with its high noise, yields $\hat{k} = 1$: search provides no benefit at any width tested. PRM scoring, with lower noise, yields $\hat{k} \geq 4$, with gains of up to 8.9 percentage points. The same model, the same algorithm, but different scorers place $\hat{k}$ at opposite ends of the beam width range. Our analysis identifies the scorer's signal-to-noise ratio as the key quantity governing beam width selection, and we propose diagnostic indicators for choosing the beam width in practice.
更多测试时计算可能有害:大语言模型束搜索中的高估偏差 / More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
这篇论文发现,在大型语言模型的推理过程中,盲目增加束搜索的宽度(即考虑更多候选路径)反而可能降低输出质量,其根本原因在于评分器的噪声会导致系统性的高估偏差,而决定最佳搜索宽度的关键因素是评分器输出信号与噪声的比值。
源自 arXiv: 2603.15377