TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

📄 Abstract - TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

基于轨迹信息优势重加权的LLM拒答学习 / TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

1️⃣ 一句话总结

本文提出了一种名为TIAR的新方法，利用模型生成回答过程中的多条候选轨迹作为置信度信号，动态调整奖励权重，从而更好地训练大语言模型学会在不确定时主动拒答，以有效减少幻觉现象。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要