菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-19
📄 Abstract - Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

顶级标签: systems multi-modal llm
详细标签: model efficiency token pruning inference acceleration cross-modal attention training-free method 或 搜索:

面向高效全模态大语言模型的分阶段自适应令牌选择方法 / Stage-adaptive Token Selection for Efficient Omni-modal LLMs


1️⃣ 一句话总结

本文提出一种无需额外训练的令牌选择方法SEATS,通过分析多模态大模型中视觉和音频令牌在各层的重要性变化,在模型不同阶段(输入前、中间层、后期层)自适应地剪枝冗余令牌,以极低的计算成本(仅保留10%的非文本令牌)实现近5倍的速度提升,同时保持96%以上的模型性能。

源自 arXiv: 2605.20035