菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - Instruction-based Image Editing with Planning, Reasoning, and Generation

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

顶级标签: multi-modal computer vision model training
详细标签: instruction-based image editing multi-modal chain-of-thought diffusion models region reasoning hint-guided generation 或 搜索:

基于指令的图像编辑:规划、推理与生成 / Instruction-based Image Editing with Planning, Reasoning, and Generation


1️⃣ 一句话总结

这篇论文提出了一种新的多模态智能方法,通过‘思维链’式的规划、区域推理和生成三步走,让AI能更准确地理解复杂指令并编辑真实世界图像,效果优于以往方法。

源自 arXiv: 2602.22624