GlimpRouter:通过窥视一个思维标记实现高效的协同推理 / GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
1️⃣ 一句话总结
这篇论文提出了一种名为GlimpRouter的新方法,它通过让轻量级模型仅生成每个推理步骤的第一个词,并根据该词的‘不确定性’来判断是否需要动用大型模型来完成整个步骤,从而在保证准确率的同时,大幅降低了大型推理模型的运算成本和延迟。
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
GlimpRouter:通过窥视一个思维标记实现高效的协同推理 / GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
这篇论文提出了一种名为GlimpRouter的新方法,它通过让轻量级模型仅生成每个推理步骤的第一个词,并根据该词的‘不确定性’来判断是否需要动用大型模型来完成整个步骤,从而在保证准确率的同时,大幅降低了大型推理模型的运算成本和延迟。
源自 arXiv: 2601.05110