菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.

顶级标签: computer vision machine learning
详细标签: ood detection sparse autoencoders vision transformers feature disentanglement interpretability 或 搜索:

稀疏性作为关键:从潜在结构中挖掘新洞察用于分布外检测 / Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection


1️⃣ 一句话总结

本文首次将稀疏自编码器应用于视觉Transformer的[CLS]特征,通过解耦稠密表示为结构化的潜在空间,发现分布内数据具有类别稳定的激活模式,并据此提出一种基于能量分布偏差的评分方法,显著提升了图像分布外检测的性能。

源自 arXiv: 2604.26409