📄
Abstract - SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at this https URL.
SAE作为水晶球:无需训练,利用可解释特征预测大语言模型的跨领域可迁移性 /
SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
1️⃣ 一句话总结
这篇论文提出了一种名为STS的新方法,它利用稀疏自编码器分析大语言模型内部特征的变化,能够在模型进行下游任务微调之前,就准确预测其在不同领域应用中的表现好坏,为模型优化提供了可解释的指导工具。