菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-05
📄 Abstract - Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

顶级标签: computer vision multi-modal model training
详细标签: visual tracking mixture-of-experts unified framework modality fusion robustness 或 搜索:

统一多模态视觉追踪与双专家混合模型 / Unified Multimodal Visual Tracking with Dual Mixture-of-Experts


1️⃣ 一句话总结

本文提出了一种名为OneTrackerV2的统一多模态视觉追踪框架,通过创新的双专家混合结构(DMoE)和元合并器,实现了对RGB及其他多种输入模态的端到端训练,在12个基准测试中取得最佳性能,且即使缺失部分模态仍能保持稳定追踪效果。

源自 arXiv: 2605.03716