用于提升思维链忠实度的反事实模拟训练 / Counterfactual Simulation Training for Chain-of-Thought Faithfulness
1️⃣ 一句话总结
这篇论文提出了一种名为‘反事实模拟训练’的新方法,通过训练大语言模型使其思维链推理过程更忠实可靠,从而帮助人们更准确地理解模型决策背后的真实原因。
Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at this https URL
用于提升思维链忠实度的反事实模拟训练 / Counterfactual Simulation Training for Chain-of-Thought Faithfulness
这篇论文提出了一种名为‘反事实模拟训练’的新方法,通过训练大语言模型使其思维链推理过程更忠实可靠,从而帮助人们更准确地理解模型决策背后的真实原因。
源自 arXiv: 2602.20710