菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-30
📄 Abstract - Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

顶级标签: llm theory model evaluation
详细标签: reasoning interpretability unsupervised discovery sparse autoencoders latent behavior activation steering 或 搜索:

奇妙的推理行为及其发现:推理过程的非监督式探索 / Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process


1️⃣ 一句话总结

这篇论文提出了一种名为RISE的非监督框架,通过稀疏自编码器在大语言模型的激活空间中自动发现并分离出可解释的推理行为(如反思、回溯),并能对这些行为进行针对性干预以可控地引导模型的推理过程,而无需重新训练模型。

源自 arXiv: 2512.23988