Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

📄 Abstract - Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

基于多智能体递归思维链的微服务系统深度根因定位方法 / Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

1️⃣ 一句话总结

本文提出了一种名为RCLAgent的微服务故障根因定位框架，通过让多个智能体并行且递归地分析系统调用链中的每一环节，有效解决了现有大语言模型方法中信息爆炸和串行推理效率低的问题，显著提升了定位准确度和效率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要