多语言引导的设计原则:多语言稀疏自编码器与层次选择原理 / Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
1️⃣ 一句话总结
本文提出了一种基于多语言数据的稀疏自编码器训练方法,并结合一种新的分层选择规则,显著提升了大型语言模型在多语言场景下语言控制的可解释性和生成质量,为解决跨语言任务中的可靠引导问题提供了理论指导和实践方案。
Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.
多语言引导的设计原则:多语言稀疏自编码器与层次选择原理 / Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
本文提出了一种基于多语言数据的稀疏自编码器训练方法,并结合一种新的分层选择规则,显著提升了大型语言模型在多语言场景下语言控制的可解释性和生成质量,为解决跨语言任务中的可靠引导问题提供了理论指导和实践方案。
源自 arXiv: 2605.23036