菜单

🤖 系统
📄 Abstract - Step-Audio-EditX Technical Report

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

顶级标签: audio llm model training
详细标签: audio editing text-to-speech synthetic data emotion control zero-shot 或 搜索:

📄 论文总结

Step-Audio-EditX 技术报告 / Step-Audio-EditX Technical Report


1️⃣ 一句话总结

这篇论文提出了首个基于大语言模型的开源音频编辑工具Step-Audio-EditX,它通过创新的合成数据训练方法,实现了对音频情感、说话风格等细节的高表现力编辑和零样本语音生成,并在多项任务中超越了现有先进模型。


📄 打开原文 PDF