菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

顶级标签: agents multi-modal model training
详细标签: visual generation agentic model end-to-end learning benchmark reinforcement learning 或 搜索:

VisionCreator:一个具备理解、思考、规划和创造能力的原生视觉生成智能体模型 / VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation


1️⃣ 一句话总结

这篇论文提出了一个名为VisionCreator的新型智能体模型,它通过一个端到端的可学习框架,将理解、思考、规划和创造能力融为一体,能够自主完成复杂的视觉内容创作任务,并且在多项测试中表现优于更大的闭源模型。

源自 arXiv: 2603.02681