菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-18
📄 Abstract - 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

顶级标签: multi-modal video benchmark
详细标签: 4d video understanding multimodal llm perceptual distillation region-level prompting video question answering 或 搜索:

4D-RGPT:通过感知蒸馏实现区域级四维理解 / 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation


1️⃣ 一句话总结

这篇论文提出了一个名为4D-RGPT的新型多模态大模型,它通过一种创新的感知蒸馏训练方法,显著提升了AI对视频中三维结构和时间动态变化的区域级理解能力,并为此创建了一个专门的评测基准。

源自 arXiv: 2512.17012