菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-20
📄 Abstract - EAST: Early Action Prediction Sampling Strategy with Token Masking

Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.

顶级标签: computer vision video model training
详细标签: early action prediction token masking training strategy action anticipation state-of-the-art 或 搜索:

EAST:基于令牌掩码的早期动作预测采样策略 / EAST: Early Action Prediction Sampling Strategy with Token Masking


1️⃣ 一句话总结

本文提出了一种名为EAST的简单高效框架,通过随机采样视频中的时间分割点并联合学习已观测与未来画面信息,让单个模型能在任意观测比例下提前预测动作,同时利用令牌掩码技术将显存和训练时间减半,在三项主流基准测试中大幅刷新了最佳准确率。

源自 arXiv: 2604.18367