3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

📄 Abstract - 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at this https URL.

3DSPA：一种用于评估视频真实性的3D语义点自动编码器 / 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

1️⃣ 一句话总结

这篇论文提出了一个名为3DSPA的自动化框架，它通过结合三维运动轨迹和场景语义来评估AI生成视频的真实性，无需参考视频，能有效检测违反物理规律的画面，并且其评估结果与人类判断高度一致。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要