通道位置限制了潜意识学习的可审计性 / Channel Location Constrains the Auditability of Subliminal Learning
1️⃣ 一句话总结
这篇论文发现,在知识蒸馏中,学生模型能否被提前审计出偷偷学到了老师的隐藏特征,关键不在于模型大小或身份,而在于隐藏特征所依赖的传递通道类型:当特征通过初始化相关通道传递时,审计可行;但当特征通过词汇几何或网络深层计算通道传递时,传统审计方法失效,且即使从标签中删除了目标特征,相关偏好仍会意外转移。
Subliminal learning lets a student inherit a teacher's hidden trait from distillation data that never names it. We ask when such transfer can be audited before training. The answer is not model identity or scale alone, but channel location: the carrier through which the trait reaches the student. We find three regimes. In a controlled initialization-dependent body channel, a pre-training screen works. Coverage, the cosine between the student's initial distillation update and the teacher's fine-tuning displacement, predicts held-out transfer (Spearman $\rho \approx 0.95$; AUROC 0.997). In pretrained language models, masked single-token traits instead ride convergent vocabulary geometry. This channel is initialization-independent, so initialization-alignment screens, including coverage, are not mechanistic; the useful handles are post-hoc detection and targeted mitigation. Even when a single-token named entity is removed from the loss, the student's held-out probability for that entity rises to 0.40 on average ($\sim 2500\times$), and a related semantic class transfers. In an untied-head model, orthogonalizing the trait's output row against entangled neighbours collapses leakage, while equal-size random-subspace edits do not. Thus removing a target string from distillation labels does not remove the corresponding preference: neighbouring tokens can carry it. Finally, conditional behaviours can route through the network body. For sycophancy, with agreement and correction markers masked from the loss, transfer reaches about 0.63 of the teacher's effect, localizes to body computation, and evades four audits across two model families. We scope this as masked transfer of a condition-present policy. Channel location is necessary for deciding which audits can be sound. It is not a deployment-ready screen: an audit used outside its carrier regime can give false assurance.
通道位置限制了潜意识学习的可审计性 / Channel Location Constrains the Auditability of Subliminal Learning
这篇论文发现,在知识蒸馏中,学生模型能否被提前审计出偷偷学到了老师的隐藏特征,关键不在于模型大小或身份,而在于隐藏特征所依赖的传递通道类型:当特征通过初始化相关通道传递时,审计可行;但当特征通过词汇几何或网络深层计算通道传递时,传统审计方法失效,且即使从标签中删除了目标特征,相关偏好仍会意外转移。
源自 arXiv: 2606.22019