📄
Abstract - Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at this https URL.
Bian Que:一种支持灵活技能编排的在线系统运维智能体框架 /
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
1️⃣ 一句话总结
本文提出了一种名为Bian Que的智能体框架,通过将运维工作抽象为三种标准模式,并让大语言模型自动生成和更新每个操作场景所需的专属“技能”(即数据和知识的检索规则),从而有效解决了大型在线系统运维中信息过载和人工编排困难的问题,在快手电商搜索系统中减少了75%的告警,并显著缩短了故障修复时间。