菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference

Prefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D hardware resources, subject to constraints on total throughput, service level objectives (SLOs), and request characteristics - specifically input and output lengths. To address this gap, we propose a hybrid approach that combines theoretical modeling with empirical benchmarking. First, we present a theoretical model for calculating P/D resource counts, which is based on total throughput requirements, request input and output lengths, as well as prefill and decode throughput. Then, to obtain the actual prefill and decode throughput under SLO constraints, we model the prefill process using M/M/1 queuing theory, deriving the achieved prefill throughput from the benchmarked maximum prefill throughput and Time-To-First-Token (TTFT). For the decode phase, we determine the decode batch sizes that meet Time-Per-Output-Token (TPOT) requirements and obtain the corresponding decode throughput through empirical measurements. Our experimental results demonstrate that the proposed method can accurately predict optimal P/D resource allocation in real-world LLM inference scenarios.

顶级标签: llm systems model evaluation
详细标签: resource allocation inference optimization prefill-decode disaggregation slo queuing theory 或 搜索:

面向满足服务水平目标的预填充-解码分离式大语言模型推理的计算资源分配 / SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference


1️⃣ 一句话总结

这篇论文提出了一种结合理论建模和实际测量的方法,来精确计算在满足特定服务质量和请求特征(如输入输出长度)的条件下,预填充和解码这两个关键阶段各自需要多少计算资源,从而高效部署大语言模型推理服务。

源自 arXiv: 2603.04716