菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-24
📄 Abstract - Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

顶级标签: video audio multi-modal
详细标签: video-to-audio length generalization long-form generation multimodal alignment mamba 或 搜索:

跨越时间的回响:解锁视频到音频生成模型的长度泛化能力 / Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models


1️⃣ 一句话总结

这项研究提出了一种名为MMHNet的新方法,通过结合分层结构和非因果Mamba技术,成功让视频生成音频的模型在仅用短视频训练后,也能生成长达5分钟以上的高质量音频,解决了模型从短样本到长样本的泛化难题。

源自 arXiv: 2602.20981