← 返回列表

arXiv 提交日期: 2026-03-18

📄 Abstract - Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

顶级标签: computer vision model training

面向3D手部姿态估计的手势感知预训练与令牌融合 / Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

1️⃣ 一句话总结

这篇论文提出了一种利用手势标签作为先验知识的两阶段方法，通过手势感知预训练和令牌融合Transformer，有效提升了从单张RGB图像估计3D手部姿态的准确性。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2603.17396

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要