菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - OmniGAIA: Towards Native Omni-Modal AI Agents

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

顶级标签: agents multi-modal benchmark
详细标签: omni-modal agents tool usage cross-modal reasoning foundation agent evaluation benchmark 或 搜索:

OmniGAIA:迈向原生全模态AI助手 / OmniGAIA: Towards Native Omni-Modal AI Agents


1️⃣ 一句话总结

这篇论文提出了一个名为OmniGAIA的全模态AI助手评估基准,并开发了一个名为OmniAtlas的原生全模态基础智能体,旨在让AI能像人类一样综合处理视觉、听觉和语言信息,并进行复杂推理和工具调用,以更好地解决现实世界中的复杂任务。

源自 arXiv: 2602.22897