菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-27
📄 Abstract - Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

顶级标签: multi-modal model training computer vision
详细标签: vision-language models autoregressive supervision multimodal comprehension visual token prediction generalist visual agents 或 搜索:

Youtu-VL:通过统一的视觉-语言监督释放视觉潜能 / Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision


1️⃣ 一句话总结

这篇论文提出了一个名为Youtu-VL的新框架,它通过将视觉信息也作为模型学习的目标(而非仅仅是辅助输入),让AI模型能更精细地理解图像内容,从而在多种视觉和图文任务上取得优秀表现。

源自 arXiv: 2601.19798