📄
Abstract - PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
PASCAL:一种用于服务基于推理的大语言模型的阶段感知调度算法 /
PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
1️⃣ 一句话总结
这篇论文提出了一种名为PASCAL的智能调度算法,它通过识别并优先处理大语言模型的‘思考’阶段来显著加快首个输出词的生成速度,同时巧妙管理‘回答’阶段的资源分配,从而在保证回答质量的前提下,大幅提升了推理类AI服务的响应效率。