心理学概念神经元:神经控制能否在大型语言模型中引导探测并改变生成? / Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?
1️⃣ 一句话总结
这项研究发现,通过定位并操控大型语言模型中与“大五”人格特质相关的特定神经元,可以有效改变模型内部对这些特质的表征,但难以稳定地控制模型最终生成与人格相关的文本行为,揭示了模型内部表征控制与外部行为控制之间的差距。
Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user's personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model's internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.
心理学概念神经元:神经控制能否在大型语言模型中引导探测并改变生成? / Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?
这项研究发现,通过定位并操控大型语言模型中与“大五”人格特质相关的特定神经元,可以有效改变模型内部对这些特质的表征,但难以稳定地控制模型最终生成与人格相关的文本行为,揭示了模型内部表征控制与外部行为控制之间的差距。
源自 arXiv: 2604.11802