菜单

🤖 系统
📄 Abstract - Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.

顶级标签: multi-modal benchmark model evaluation
详细标签: text-to-multi-image generation causal reasoning spatiotemporal consistency world knowledge dynamic process modeling 或 搜索:

Envision:面向因果世界过程洞察的统一理解与生成基准 / Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights


1️⃣ 一句话总结

这篇论文提出了一个名为Envision的新基准,用于评估AI模型在理解和生成随时间展开的、符合因果关系的多图像序列方面的能力,发现现有模型在动态世界过程建模和时空一致性方面仍面临核心挑战。


📄 打开原文 PDF