Why Does RLAIF Work At All?

📄 Abstract - Why Does RLAIF Work At All?

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

为什么RLAIF（从AI反馈中强化学习）会有效？ / Why Does RLAIF Work At All?

1️⃣ 一句话总结

这篇论文提出一个理论来解释为什么AI模型能通过自我偏好判断来改进自身：模型在预训练时已将人类价值观编码到其内部表示中，而特定的引导指令（宪法）能将这些潜在的价值观‘激活’出来用于判断，从而实现对模型行为的有效对齐和改进。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要