Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

📄 Abstract - Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.

Chimera：面向异构大语言模型的延迟与性能感知多智能体服务系统 / Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

1️⃣ 一句话总结

这篇论文提出了一个名为Chimera的智能调度系统，它能让由不同规模和能力的大语言模型组成的异构集群，在协同处理多智能体复杂任务时，同时实现更低的延迟和更高的任务成功率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要