菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-19
📄 Abstract - FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

顶级标签: llm model training model evaluation
详细标签: speculative decoding attention tuning draft verification parallel decoding calibration 或 搜索:

FlexDraft:通过注意力调优和奖励引导校准实现灵活的投机解码 / FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration


1️⃣ 一句话总结

本文提出了一种名为FlexDraft的新型投机解码框架,通过仅调整少数注意力层、利用奖励令牌动态校准草稿以及自适应切换解码策略,在无需重新训练且不降低生成质量的前提下,显著提升了大语言模型在不同批次大小下的推理速度。

源自 arXiv: 2605.20022