为什么RLAIF(从AI反馈中强化学习)会有效? / Why Does RLAIF Work At All?
1️⃣ 一句话总结
这篇论文提出一个理论来解释为什么AI模型能通过自我偏好判断来改进自身:模型在预训练时已将人类价值观编码到其内部表示中,而特定的引导指令(宪法)能将这些潜在的价值观‘激活’出来用于判断,从而实现对模型行为的有效对齐和改进。
Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.
为什么RLAIF(从AI反馈中强化学习)会有效? / Why Does RLAIF Work At All?
这篇论文提出一个理论来解释为什么AI模型能通过自我偏好判断来改进自身:模型在预训练时已将人类价值观编码到其内部表示中,而特定的引导指令(宪法)能将这些潜在的价值观‘激活’出来用于判断,从而实现对模型行为的有效对齐和改进。
源自 arXiv: 2603.03000