Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

📄 Abstract - Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

基于语义最优传输的稀疏自编码器特征匹配与电路压缩 / Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

1️⃣ 一句话总结

本文提出一种统一框架，通过将每个稀疏自编码器特征表示为激活值加权的隐藏状态分布，并利用最优传输理论中的Wasserstein距离在共享空间中比较这些分布，从而同时解决跨层特征匹配和大规模特征电路压缩两个难题，生成可解释的超级节点。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要