PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

📄 Abstract - PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.

PASCAL：一种用于服务基于推理的大语言模型的阶段感知调度算法 / PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

1️⃣ 一句话总结

这篇论文提出了一种名为PASCAL的智能调度算法，它通过识别并优先处理大语言模型的‘思考’阶段来显著加快首个输出词的生成速度，同时巧妙管理‘回答’阶段的资源分配，从而在保证回答质量的前提下，大幅提升了推理类AI服务的响应效率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要