📄
Abstract - Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.
跨模型族推测式预填充:利用小型草稿模型实现无需训练的长上下文压缩 /
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
1️⃣ 一句话总结
这项研究发现,利用一个轻量级的小模型来压缩长文本提示,即使该小模型与最终使用的大模型来自不同技术家族、使用不同分词器,也能在保持90%以上准确率的同时,显著加快大模型的首次响应速度,从而为需要频繁处理长文本的AI代理系统提供了一种高效且通用的提速方案。