菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-14
📄 Abstract - PreFT: Prefill-only finetuning for efficient inference

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

顶级标签: llm model training model evaluation
详细标签: parameter efficient finetuning inference efficiency prefill-only adaptation multi-adapter serving accuracy-throughput tradeoff 或 搜索:

PreFT:仅预填充微调实现高效推理 / PreFT: Prefill-only finetuning for efficient inference


1️⃣ 一句话总结

本文提出一种名为PreFT的微调方法,仅在模型处理输入(预填充阶段)应用适配器,在生成阶段丢弃它,从而在几乎不影响模型性能的情况下,将同时服务数百个个性化适配器的推理吞吐量提升近两倍。

源自 arXiv: 2605.14217