菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-10
📄 Abstract - TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

顶级标签: computer vision multi-modal aigc
详细标签: text-to-3d hand-object interaction multi-view generation mesh optimization discrete representation 或 搜索:

TextHOI-3D:基于离散多视图生成与联合网格优化的文本到3D手物交互生成 / TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization


1️⃣ 一句话总结

本文提出一个两阶段框架,先根据文本提示生成手与物体交互的多视角离散图像,再通过联合优化将这些图像重建为高质量、无穿透的3D手物网格模型,显著提升了从文字生成3D手物交互的几何精度和物理合理性。

源自 arXiv: 2606.11805