菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-15
📄 Abstract - Future Optical Flow Prediction Improves Robot Control & Video Generation

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

顶级标签: computer vision multi-modal robotics
详细标签: optical flow prediction vision-language model diffusion models robot manipulation video generation 或 搜索:

未来光流预测改进机器人控制与视频生成 / Future Optical Flow Prediction Improves Robot Control & Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为FOFPred的新模型,它结合了视觉语言模型和扩散模型,能够根据语言指令预测未来的物体运动趋势(光流),并成功应用于机器人操控和视频生成两个不同领域,展示了从海量网络视频数据中学习通用运动预测的潜力。

源自 arXiv: 2601.10781