变分自编码器设计对基于扩散模型的手语生成中潜在姿态表征的影响 / The Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production
1️⃣ 一句话总结
本文研究了在手语生成任务中,变分自编码器的结构和训练目标设计如何影响其潜在空间的性质,进而影响下游扩散模型的生成效果,并发现潜在空间的特性比单纯的重建精度更能解释生成质量的差异。
Latent diffusion approaches to sign language production (SLP) rely on an initial stage that learns an encoding of sign pose sequences, enabling generative modeling in the resulting latent space. The autoencoder used in this stage is typically evaluated in terms of reconstruction quality using geometric metrics common in SLP. While informative, these metrics do not fully capture latent space properties that may influence the training and performance of the downstream generative model. In this work, we investigate how architectural and training objective design choices in a variational autoencoder (VAE) for sign pose encoding affect latent space structure, and how these differences translate into the performance of a latent diffusion model for text-to-sign generation. Our experiments on Phoenix14T dataset show that variations in generative performance, measured through back-translation BLEU scores, can sometimes be better explained by differences in latent space properties than by VAE reconstruction accuracy alone.
变分自编码器设计对基于扩散模型的手语生成中潜在姿态表征的影响 / The Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production
本文研究了在手语生成任务中,变分自编码器的结构和训练目标设计如何影响其潜在空间的性质,进而影响下游扩散模型的生成效果,并发现潜在空间的特性比单纯的重建精度更能解释生成质量的差异。
源自 arXiv: 2606.22959