菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-10
📄 Abstract - Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective and reliable than LLMs and more controllable than dense embeddings. Using the large hypothesis space of SAEs, we can uncover insights such as (1) semantic differences between datasets and (2) unexpected concept correlations in documents. For instance, by comparing model responses, we find that Grok-4 clarifies ambiguities more often than nine other frontier models. Relative to LLMs, SAE embeddings uncover bigger differences at 2-8x lower cost and identify biases more reliably. Additionally, SAE embeddings are controllable: by filtering concepts, we can (3) cluster documents along axes of interest and (4) outperform dense embeddings on property-based retrieval. Using SAE embeddings, we study model behavior with two case studies: investigating how OpenAI model behavior has changed over time and finding "trigger" phrases learned by Tulu-3 (Lambert et al., 2024) from its training data. These results position SAEs as a versatile tool for unstructured data analysis and highlight the neglected importance of interpreting models through their data.

顶级标签: natural language processing model evaluation machine learning
详细标签: sparse autoencoders interpretable representations data analysis embedding analysis model behavior 或 搜索:

可解释嵌入与稀疏自编码器:一种数据分析工具包 / Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit


1️⃣ 一句话总结

这篇论文提出了一种使用稀疏自编码器生成可解释嵌入的新方法,相比传统的大语言模型和密集嵌入,它能以更低的成本、更高的可控性和可靠性,帮助研究人员分析大规模文本数据,从而发现数据集差异、模型偏见和隐藏概念关联。


源自 arXiv: 2512.10092