Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

📄 Abstract - Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

基于野外先验的参考驱动多说话人音频场景生成 / Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

1️⃣ 一句话总结

本文提出了一种名为ScenA的新方法，利用一个在自然环境中预训练的音频基础模型，只需输入多个说话人的参考声音和一段描述整个对话场景的自然语言，就能直接生成包含背景噪音、混响、重叠对话和情感声音的逼真多说话人音频场景，并通过一种高噪声偏置的训练策略解决了模型可能绕过文本指令而仅依赖声音相似性的‘参考捷径’问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要