📄
Abstract - Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
Talk2Move:基于强化学习的文本指令场景物体几何变换框架 /
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
1️⃣ 一句话总结
这篇论文提出了一个名为Talk2Move的新方法,它利用强化学习技术,让AI能够根据简单的文字指令(比如‘把椅子向右移’),在图片中精确地移动、旋转或缩放物体,并且保持整个场景看起来自然合理,效果比现有技术更好。