迈向全能具身智能体:从孤立技能到日常物理自主 / Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy
1️⃣ 一句话总结
本文提出了一种名为OmniAct的框架,通过将规划、记忆和验证分离为独立的异步模块,使机器人能够在复杂、非结构化的环境中长时间自主执行任务,并自动从物理故障中恢复,显著提升了多设备协作的效率和稳定性。
Building persistent embodied agents in unstructured environments demands unified orchestration of heterogeneous tools spanning both cyber (APIs, IoT) and physical (manipulation, navigation) domains, coupled with autonomous recovery from physical failures that inevitably arise over extended operation. Existing systems treat these as separate problems: VLM-based planners lack a unified cyber-physical action space, agent frameworks accumulate unbounded context that degrades temporal coherence, and VLA policies execute open-loop without detecting their own failures. We argue that persistent autonomy requires not a monolithic model but a hierarchical asynchronous architecture with explicit separation of planning, memory, and verification. To this end, we present OmniAct, a framework integrating a multimodal semantic planner for skill routing across unified action spaces, an adaptive hierarchical memory with event-boundary-driven compression for sub-linear context growth, and an asynchronous visual preemption engine that closes the semantic loop during physical execution. Across 40 real-world long-horizon tasks on two robotic platforms coordinating four IoT devices, OmniAct achieves consistent improvements in end-to-end success across all complexity levels, maintains near-flat token consumption over under 100k+ accumulated interaction tokens, and elevates mid-scale open-weight models to proprietary-level performance.
迈向全能具身智能体:从孤立技能到日常物理自主 / Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy
本文提出了一种名为OmniAct的框架,通过将规划、记忆和验证分离为独立的异步模块,使机器人能够在复杂、非结构化的环境中长时间自主执行任务,并自动从物理故障中恢复,显著提升了多设备协作的效率和稳定性。
源自 arXiv: 2606.27251