快速异构服务:面向SLO约束推理的可扩展混合规模大语言模型分配 / Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
1️⃣ 一句话总结
这篇论文提出了两种高效的算法,能在满足延迟、准确率和预算等严格约束的前提下,快速地为大语言模型推理服务自动选择和配置不同型号的GPU资源,在保证服务质量的同时大幅降低了计算成本。
Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.
快速异构服务:面向SLO约束推理的可扩展混合规模大语言模型分配 / Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
这篇论文提出了两种高效的算法,能在满足延迟、准确率和预算等严格约束的前提下,快速地为大语言模型推理服务自动选择和配置不同型号的GPU资源,在保证服务质量的同时大幅降低了计算成本。
源自 arXiv: 2604.07472