📄
Abstract - Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
机制化数据归因:追踪可解释大语言模型单元的训练起源 /
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
1️⃣ 一句话总结
这篇论文提出了一个名为MDA的新方法,能够像‘基因溯源’一样,精准找出训练数据中哪些具体样本催生了模型内部的可解释功能单元(如‘归纳头’),并通过实验证实了这些单元与模型上下文学习能力之间的因果联系,最终还利用这一发现开发了一种能有效引导模型发展的数据增强技术。