See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

📄 Abstract - See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: this https URL.

看、指、精调：基于视觉反馈的多轮图形用户界面定位方法 / See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

1️⃣ 一句话总结

这篇论文提出了一种让AI助手在复杂编程界面中更精准点击目标的新方法，它通过‘观察-点击-根据视觉反馈调整’的多轮循环来逐步修正误差，而不是一次性猜测位置，从而显著提升了在密集代码编辑器等环境中的操作成功率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要