DFlash:用于闪存推测解码的块扩散模型 / DFlash: Block Diffusion for Flash Speculative Decoding
1️⃣ 一句话总结
这篇论文提出了一种名为DFlash的新方法,它巧妙地结合了扩散模型的并行生成能力和推测解码框架,通过一个轻量级的块扩散模型来快速生成草稿文本,再由大语言模型进行并行验证,从而在不损失生成质量的前提下,将大模型的推理速度提升了6倍以上。
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
DFlash:用于闪存推测解码的块扩散模型 / DFlash: Block Diffusion for Flash Speculative Decoding
这篇论文提出了一种名为DFlash的新方法,它巧妙地结合了扩散模型的并行生成能力和推测解码框架,通过一个轻量级的块扩散模型来快速生成草稿文本,再由大语言模型进行并行验证,从而在不损失生成质量的前提下,将大模型的推理速度提升了6倍以上。
源自 arXiv: 2602.06036