面向智能体工作流的高效大语言模型服务:一个数据系统视角 / Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
1️⃣ 一句话总结
这篇论文提出了一种名为Helium的新型服务框架,它将复杂的AI智能体工作流视为数据库查询计划来优化,通过主动缓存和缓存感知调度等技术,显著提升了执行效率,比现有系统快了最多1.56倍。
Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.
面向智能体工作流的高效大语言模型服务:一个数据系统视角 / Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
这篇论文提出了一种名为Helium的新型服务框架,它将复杂的AI智能体工作流视为数据库查询计划来优化,通过主动缓存和缓存感知调度等技术,显著提升了执行效率,比现有系统快了最多1.56倍。
源自 arXiv: 2603.16104