菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-12
📄 Abstract - OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.

顶级标签: agents multi-modal systems
详细标签: computer-using agents vision-language models long-horizon workflows visual context curation tool agents 或 搜索:

OS-Symphony:一个用于构建鲁棒且通用的计算机使用智能体的整体框架 / OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent


1️⃣ 一句话总结

这篇论文提出了一个名为OS-Symphony的新型智能体框架,它通过引入里程碑驱动的长期记忆和视觉感知的教程检索两大创新,有效解决了现有计算机操作智能体在复杂长流程任务中容易出错、以及在陌生场景下适应能力差的问题,从而显著提升了智能体的鲁棒性和通用性。

源自 arXiv: 2601.07779