菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.

顶级标签: llm model training systems
详细标签: speculative decoding model pruning inference acceleration fisher information training-free optimization 或 搜索:

SDFP:基于FIT剪枝模型的推测解码,实现免训练即插即用的大语言模型加速 / SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration


1️⃣ 一句话总结

这篇论文提出了一种名为SDFP的免训练即插即用框架,它通过剪掉大语言模型中不重要的层来快速构建一个轻量化的草稿模型,从而在不改变原模型输出质量的前提下,将文本生成速度提升了1.3到1.5倍,有效降低了多媒体应用的延迟。

源自 arXiv: 2602.05499