RLHF标注的三种模型:扩展、证据与权威 / Three Models of RLHF Annotation: Extension, Evidence, and Authority
1️⃣ 一句话总结
本文梳理了基于人类反馈的强化学习(RLHF)中标注数据所扮演的三种不同角色——扩展设计者意图、提供客观证据、或赋予群体代表权威,并指出设计者应根据不同维度选择最合适的模型,而非试图用一个统一流程处理所有标注任务。
Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.
RLHF标注的三种模型:扩展、证据与权威 / Three Models of RLHF Annotation: Extension, Evidence, and Authority
本文梳理了基于人类反馈的强化学习(RLHF)中标注数据所扮演的三种不同角色——扩展设计者意图、提供客观证据、或赋予群体代表权威,并指出设计者应根据不同维度选择最合适的模型,而非试图用一个统一流程处理所有标注任务。
源自 arXiv: 2604.25895