菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

顶级标签: audio multi-modal model training
详细标签: audio foundation model audio captioning unified understanding generation speech synthesis audio generation 或 搜索:

风笛手:通过丰富描述解决开放式音频任务 / Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions


1️⃣ 一句话总结

这篇论文提出了一个名为Bagpiper的通用音频基础模型,它通过将原始音频信号与全面的自然语言描述(即“丰富描述”)相互映射,无需针对特定任务进行专门训练,就能统一处理音频理解和生成等多种复杂任务。

源自 arXiv: 2602.05220