菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-28
📄 Abstract - Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

顶级标签: reinforcement learning natural language processing model training
详细标签: reward design process supervision facts verification question answering corpus grounding 或 搜索:

超越数学与代码的可验证奖励:面向事实性问答的轻量级语料库驱动过程监督方法 / Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering


1️⃣ 一句话总结

本文提出了一种名为CorVer的轻量级奖励方法,通过利用维基百科的词共现统计信息来逐句验证模型推理过程的正确性,从而以极低的计算成本大幅提升大语言模型在事实性问答任务上的准确性。

源自 arXiv: 2605.29648