菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-13
📄 Abstract - YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a \textit{reference-free} method that learns \textit{sparse steering vectors} in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly available\footnote{this https URL}.

顶级标签: llm model training agents
详细标签: activation steering sparse autoencoder domain adaptation preference optimization alignment 或 搜索:

YaPO:用于领域自适应的可学习稀疏激活导向向量 / YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation


1️⃣ 一句话总结

这篇论文提出了一种名为YaPO的新方法,它通过在大语言模型的稀疏编码空间中学习稀疏的“导向向量”,来实现对模型行为的精细、稳定且高效的控制,适用于文化对齐、减少幻觉等多种场景,且不损害模型的通用知识。

源自 arXiv: 2601.08441