菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

顶级标签: audio multi-modal model training
详细标签: temporal perception audio-language models reinforcement learning audio grounding sound event detection 或 搜索:

迈向细粒度时间感知:利用音频侧时间提示对大音频-语言模型进行后训练 / Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt


1️⃣ 一句话总结

这项研究提出了一种名为TimePro-RL的新方法,通过向音频数据中嵌入时间戳提示并结合强化学习,有效提升了大型音频-语言模型在识别声音事件起止时间等精细时间任务上的能力。

源自 arXiv: 2604.13715