Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

📄 Abstract - Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Duel-Evolve：基于大语言模型自我偏好的无奖励测试时优化方法 / Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

1️⃣ 一句话总结

这篇论文提出了一种名为Duel-Evolve的新方法，它让大语言模型在测试时通过比较自己生成的多个候选答案的优劣来迭代优化输出，无需依赖外部评分或奖励模型，就能在数学和代码生成等任务上显著提升性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要