菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-30
📄 Abstract - Do Sparse Autoencoders Capture Concept Manifolds?

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

顶级标签: machine learning model evaluation
详细标签: sparse autoencoders concept manifolds interpretability representation learning dilution 或 搜索:

稀疏自编码器能否捕捉概念流形? / Do Sparse Autoencoders Capture Concept Manifolds?


1️⃣ 一句话总结

本文发现稀疏自编码器(SAE)虽然被广泛用于提取神经网络的独立线性特征,但实际上概念通常以低维流形(连续几何结构)的形式存在,而SAE要么通过全局覆盖要么通过局部拼贴的方式来捕捉这些流形,但由于其原子分配分散,导致概念流形结构难以被直接识别,因此未来应把几何对象(而非单个方向)作为可解释性的基本单元。

源自 arXiv: 2604.28119