ReasonAlloc:面向推理模型的解码阶段键值缓存分层预算分配方法 / ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
1️⃣ 一句话总结
本文提出一种无需重新训练的方法ReasonAlloc,在长链式推理场景下,通过离线层间与在线头间两级动态分配键值缓存预算,解决了传统均匀裁剪策略在推理过程中效率低下的问题,显著提升了小预算时模型的数学推理性能。
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.
ReasonAlloc:面向推理模型的解码阶段键值缓存分层预算分配方法 / ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
本文提出一种无需重新训练的方法ReasonAlloc,在长链式推理场景下,通过离线层间与在线头间两级动态分配键值缓存预算,解决了传统均匀裁剪策略在推理过程中效率低下的问题,显著提升了小预算时模型的数学推理性能。
源自 arXiv: 2606.11164