SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

📄 Abstract - SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

SUN：共享下一词预测以实现高效的多LLM解耦服务 / SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

1️⃣ 一句话总结

这篇论文提出了一种名为SUN的新方法，通过将大语言模型的解码部分冻结并共享给多个模型使用，显著提升了多模型同时服务时的GPU利用率和系统吞吐量，同时保持了模型的准确性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要