RelayLLM:通过协同解码实现高效推理 / RelayLLM: Efficient Reasoning via Collaborative Decoding
1️⃣ 一句话总结
这篇论文提出了一种名为RelayLLM的新方法,它让小型语言模型在生成文本时像接力赛一样,只在遇到关键难题时才动态调用大型语言模型来帮忙,从而用极低的成本(仅调用1.07%的令牌)实现了接近大型模型的推理性能,大幅降低了计算开销。
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
RelayLLM:通过协同解码实现高效推理 / RelayLLM: Efficient Reasoning via Collaborative Decoding
这篇论文提出了一种名为RelayLLM的新方法,它让小型语言模型在生成文本时像接力赛一样,只在遇到关键难题时才动态调用大型语言模型来帮忙,从而用极低的成本(仅调用1.07%的令牌)实现了接近大型模型的推理性能,大幅降低了计算开销。
源自 arXiv: 2601.05167