套利:通过优势感知推测实现高效推理 / Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
1️⃣ 一句话总结
这篇论文提出了一种名为Arbitrage的新方法,它通过一个轻量级的‘路由器’智能判断何时使用快速但不精确的草稿模型、何时使用精确但缓慢的目标模型来生成推理步骤,从而在保持大语言模型推理准确性的同时,显著提升了生成速度。
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
套利:通过优势感知推测实现高效推理 / Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
这篇论文提出了一种名为Arbitrage的新方法,它通过一个轻量级的‘路由器’智能判断何时使用快速但不精确的草稿模型、何时使用精确但缓慢的目标模型来生成推理步骤,从而在保持大语言模型推理准确性的同时,显著提升了生成速度。
源自 arXiv: 2512.05033