LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

📄 Abstract - LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at this https URL.

LoPA：通过前瞻并行解码扩展扩散大语言模型推理 / LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

1️⃣ 一句话总结

这篇论文提出了一种名为LoPA的无训练即插即用算法，通过并行探索不同的候选令牌填充顺序并选择未来并行潜力最高的路径，将扩散大语言模型单次前向传递生成的令牌数量大幅提升至10个以上，从而显著加速了模型推理速度。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要