迈向数学错误纠正中AI导师的奖励建模 / Towards Reward Modeling for AI Tutors in Math Mistake Remediation
1️⃣ 一句话总结
这篇论文提出了一种新方法来评估和提升AI数学导师的教学质量,通过分析人类偏好数据并合成对比样本,训练出能准确判断导师回复是否有效帮助学生发现和改正错误的奖励模型。
Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.
迈向数学错误纠正中AI导师的奖励建模 / Towards Reward Modeling for AI Tutors in Math Mistake Remediation
这篇论文提出了一种新方法来评估和提升AI数学导师的教学质量,通过分析人类偏好数据并合成对比样本,训练出能准确判断导师回复是否有效帮助学生发现和改正错误的奖励模型。
源自 arXiv: 2603.24375