菜单

🤖 系统
📄 Abstract - Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

顶级标签: theory machine learning model training
详细标签: ai alignment rlhf trilemma formal analysis robustness 或 搜索:

对齐三难困境:RLHF系统的根本限制 / Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma


1️⃣ 一句话总结

本文形式化提出了'对齐三难困境',指出任何基于人类反馈的强化学习系统都无法同时实现三个理想目标:全面代表多样化人类价值观、计算可处理性以及抗干扰鲁棒性。


2️⃣ 论文创新点

1. 对齐三难困境形式化框架

2. RLHF三阶段流程形式化

3. MaxMin-RLHF方法

4. 模块化价值架构


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

📄 打开原文 PDF