通过简单点预测实现像素级视觉语言模型感知 / Towards Pixel-Level VLM Perception via Simple Points Prediction
1️⃣ 一句话总结
这篇论文提出了一种名为SimpleSeg的简单有效方法,通过让多模态大语言模型直接预测描述物体边界的坐标点序列,成功赋予了它像素级的图像分割能力,无需复杂专用设计就能达到甚至超越传统方法的性能。
We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: this https URL
通过简单点预测实现像素级视觉语言模型感知 / Towards Pixel-Level VLM Perception via Simple Points Prediction
这篇论文提出了一种名为SimpleSeg的简单有效方法,通过让多模态大语言模型直接预测描述物体边界的坐标点序列,成功赋予了它像素级的图像分割能力,无需复杂专用设计就能达到甚至超越传统方法的性能。
源自 arXiv: 2601.19228