菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - Robust Reward Modeling for Large Language Models via Causal Decomposition

Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt's intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.

顶级标签: llm model training model evaluation
详细标签: reward modeling causal decomposition alignment regularization robustness 或 搜索:

通过因果分解实现大语言模型的稳健奖励建模 / Robust Reward Modeling for Large Language Models via Causal Decomposition


1️⃣ 一句话总结

这篇论文提出了一种新方法,通过训练一个解码器来重构用户提问的潜在意图,并利用重构误差来指导奖励模型,从而有效减少奖励模型对答案长度、讨好语气等表面线索的依赖,使其更专注于理解用户真实意图,最终在多个任务上提升了模型的判断准确性和输出质量。

源自 arXiv: 2604.13833