菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-18
📄 Abstract - AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at this https URL.

顶级标签: computer vision model training machine learning
详细标签: masked autoencoders self-supervised learning vision transformers efficient training hierarchical architectures 或 搜索:

AFFMAE:面向桌面显卡的可扩展高效视觉预训练框架 / AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards


1️⃣ 一句话总结

这篇论文提出了一种名为AFFMAE的新型自监督视觉预训练方法,它通过创新的动态合并可见图像块技术,在保持高性能的同时,大幅降低了计算和内存需求,使得在单张桌面级显卡上也能高效训练高分辨率视觉模型。

源自 arXiv: 2602.16249