VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

📄 Abstract - VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at this https URL.

VocalParse：基于大型音频语言模型的统一且可扩展的歌声转录方法 / VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

1️⃣ 一句话总结

本文提出了一种名为VocalParse的歌声转录模型，它利用大型音频语言模型，通过一种新颖的交错提示和思维链策略，能够直接从音频中同时识别歌词、旋律和词曲对齐，生成结构化乐谱，从而解决了传统多阶段转录系统复杂、泛化能力差的问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要