Rethinking Music Captioning with Music Metadata LLMs

📄 Abstract - Rethinking Music Captioning with Music Metadata LLMs

Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling--a common task for organizing music data.

基于音乐元数据大语言模型的音乐描述生成方法再思考 / Rethinking Music Captioning with Music Metadata LLMs

1️⃣ 一句话总结

这篇论文提出了一种新的音乐描述生成方法，它先通过模型从音频中提取详细的音乐元数据，再借助大语言模型将这些元数据转换成生动的文字描述，这种方法不仅训练效率高，还能灵活调整描述风格，并支持根据部分信息补全完整的音乐标签。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要