菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-28
📄 Abstract - Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

顶级标签: machine learning llm
详细标签: speculative decoding block diffusion tree-structured drafting budget-aware latency estimation 或 搜索:

BASTION:基于树形块扩散草稿的预算感知推测解码框架 / Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting


1️⃣ 一句话总结

本文提出一种名为BASTION的新型推测解码技术,它通过动态构建树状扩散草稿、并智能权衡生成质量和硬件计算预算,在无需重新训练的条件下,将大语言模型的文本生成速度提升最高6.6倍,同时保持输出质量不变。

源自 arXiv: 2605.29727