菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-29
📄 Abstract - OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

顶级标签: multi-modal agents model evaluation
详细标签: audio-visual understanding active perception tool orchestration benchmark multimodal alignment 或 搜索:

OmniAgent:用于全模态音视频理解的音频引导主动感知智能体 / OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding


1️⃣ 一句话总结

这篇论文提出了一个名为OmniAgent的智能体,它能够主动利用音频线索来动态调用工具,从而更精细地理解和分析音视频内容,在多个基准测试中取得了领先的性能。

源自 arXiv: 2512.23646