菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-22
📄 Abstract - SAMTok: Representing Any Mask with Two Words

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

顶级标签: multi-modal computer vision model training
详细标签: mask tokenization region understanding segmentation vision-language models reinforcement learning 或 搜索:

SAMTok:用两个词表示任意掩码 / SAMTok: Representing Any Mask with Two Words


1️⃣ 一句话总结

这篇论文提出了一种名为SAMTok的新方法,它能够将复杂的图像分割区域(掩码)压缩成两个特殊的“词语”表示,从而让通用多模态大语言模型无需复杂改造,就能通过简单的语言学习方式理解和生成图像中的精确区域,显著提升了模型处理像素级任务的能力。

源自 arXiv: 2601.16093