📄
Abstract - Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce AdaptEdit, a co-trained, instruction- and region-aware adapter framework that retro-fits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image -- eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, AdaptEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.
在你想要的地方编辑:面向无遮罩局部图像编辑的区域感知适配器注入方法 /
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
1️⃣ 一句话总结
本文提出了一种名为AdaptEdit的方法,通过在冻结的大型扩散变压器模型中轻量级地插入可学习的适配器模块,使其能够只对图像中用户指令指定的局部区域进行精准编辑,而无需用户在部署时提供任何区域遮罩,在保持未编辑区域不变的同时,显著提升了局部编辑的准确性和指令遵循能力。