菜单

🤖 系统
📄 Abstract - MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

顶级标签: multi-modal agents model training
详细标签: image editing multimodal reasoning instruction following iterative reasoning tool-use dataset 或 搜索:

📄 论文总结

MIRA:用于图像编辑的多模态迭代推理智能体 / MIRA: Multimodal Iterative Reasoning Agent for Image Editing


1️⃣ 一句话总结

本文提出了一种名为MIRA的轻量级多模态推理智能体,它通过模拟人类多轮交互过程,逐步分析和执行图像编辑指令,显著提升了复杂指令下图像编辑的准确性和质量。


📄 打开原文 PDF