ControlFoley:一种具有跨模态冲突处理能力的统一可控视频转音频生成方法 / ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
1️⃣ 一句话总结
这篇论文提出了一个名为ControlFoley的先进系统,它能够根据视频内容、文字描述或参考音频片段,精确且可控地生成高质量、同步的音频,并有效解决了不同输入信息之间可能存在的冲突问题。
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: this https URL.
ControlFoley:一种具有跨模态冲突处理能力的统一可控视频转音频生成方法 / ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
这篇论文提出了一个名为ControlFoley的先进系统,它能够根据视频内容、文字描述或参考音频片段,精确且可控地生成高质量、同步的音频,并有效解决了不同输入信息之间可能存在的冲突问题。
源自 arXiv: 2604.15086