流形操控揭示神经网络表征与行为的共享几何结构 / Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
1️⃣ 一句话总结
本研究通过对比“沿激活流形操控”与“线性操控”对神经网络行为的影响,发现前者能保持模型输出的自然性和合理性,从而证明了神经网络的内部表征几何结构与最终行为之间存在深刻的因果联系,为可控的模型内部干预提供了新的几何框架。
Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.
流形操控揭示神经网络表征与行为的共享几何结构 / Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
本研究通过对比“沿激活流形操控”与“线性操控”对神经网络行为的影响,发现前者能保持模型输出的自然性和合理性,从而证明了神经网络的内部表征几何结构与最终行为之间存在深刻的因果联系,为可控的模型内部干预提供了新的几何框架。
源自 arXiv: 2605.05115