菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-10
📄 Abstract - VABench: A Comprehensive Benchmark for Audio-Video Generation

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

顶级标签: video generation benchmark multi-modal
详细标签: audio-video generation evaluation framework synchronization text-to-audio-video multi-dimensional assessment 或 搜索:

VABench:一个用于音视频生成的综合基准测试 / VABench: A Comprehensive Benchmark for Audio-Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为VABench的综合基准测试框架,旨在系统评估能够同时生成同步音频和视频的AI模型,填补了现有视频生成评估标准中缺乏音视频同步性等关键指标的空缺。


源自 arXiv: 2512.09299