Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

📄 Abstract - Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at this https URL.

预填充扩散语言模型：面向长上下文推理的预测性预填充方法 / Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

1️⃣ 一句话总结

本文提出一种无需重新训练的方法，通过将长文本分段缓存关键信息，并在生成时只选择最相关的片段进行计算，大幅加速了扩散语言模型处理长文本的速度（最高提速28倍），同时保持了甚至超过了原有模型的输出质量。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要