Zero-Gated Language-conditioned Human Motion Prediction

📄 Abstract - Zero-Gated Language-conditioned Human Motion Prediction

Pose histories provide the core kinematic evidence for 3D human motion prediction, but they lack explicit high-level semantic guidance. This paper introduces ZGL, a lightweight language-conditioned predictor that uses captions of the observed motion as a semantic prior while preserving a strong motion backbone as the main source of dynamics. We render only the observed poses, generate a one-sentence description with a vision-language model, encode the caption with a frozen CLIP-L text tower, and project it into a small set of conditioning tokens. These tokens are injected into a DCT-based spatial-temporal Transformer by compact crossattention adapters with zero gates: each adapter output is multiplied by a learnable gate initialized to zero, so the full network is numerically identical to the pose-only baseline at initialization and can learn to use language only when it reduces prediction error. On Human3.6M, ZGL improves overall MPJPE over representative motion-prediction baselines in our comparison. Results on CMUMocap further show that compact caption conditioning transfers to a second benchmark and provides a practical semantic cue for 3D human motion prediction.

零门控语言条件化人体运动预测 / Zero-Gated Language-conditioned Human Motion Prediction

1️⃣ 一句话总结

本文提出一种轻量级的语言引导方法ZGL，通过为观测到的动作生成文字描述作为语义线索，在不改变原有运动预测模型结构的前提下，显著提升3D人体运动预测的准确性，且该方法可轻松迁移到其他数据集。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要