In Pursuit of Pixel Supervision for Visual Pre-training

📄 Abstract - In Pursuit of Pixel Supervision for Visual Pre-training

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

追求基于像素监督的视觉预训练 / In Pursuit of Pixel Supervision for Visual Pre-training

1️⃣ 一句话总结

这篇论文提出了一种名为Pixio的增强型掩码自编码器模型，通过使用更具挑战性的预训练任务和更强大的架构，在数十亿网络图像上进行训练，证明了基于像素的自监督学习方法依然高效且具有竞争力，能够在多种下游视觉任务中取得与当前先进模型相当或更优的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要