菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-30
📄 Abstract - Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

顶级标签: computer vision model training multi-modal
详细标签: text-guided image editing visual autoregressive models structure preservation reinforcement learning feature injection 或 搜索:

基于视觉自回归模型重新思考文本引导图像编辑中的结构保持 / Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models


1️⃣ 一句话总结

这篇论文提出了一种基于视觉自回归模型的新框架,通过从粗到细的标记定位、结构特征注入和自适应强化学习策略,在文本引导的图像编辑中更好地保持了原始图像的结构一致性和背景,同时提升了编辑质量。

源自 arXiv: 2603.28367