📄
Abstract - VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$\pi$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$\pi$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$\pi$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$\pi$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at this https URL.
VA-π:一种用于像素感知自回归生成的变分策略对齐方法 /
VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
1️⃣ 一句话总结
这篇论文提出了一种名为VA-π的轻量级后训练框架,通过将自回归图像生成模型视为一个策略,并直接使用像素空间的重建质量作为奖励来优化它,从而有效解决了现有方法中图像编码器与生成器目标不一致导致图像质量下降的问题,仅需极少数据和极短时间就能显著提升生成图像的逼真度和多样性。