When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

📄 Abstract - When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose \textbf{TTRL-Guard}, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025. \footnote{Our code and implementation details are available at this https URL.

当多数投票出错时：测试时强化学习的干预时机隐藏在正确答案灭绝窗口中 / When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

1️⃣ 一句话总结

本文发现，测试时强化学习（TTRL）通过多数投票提升模型性能的方法存在严重误导：多数看似进步实则源于巩固本来就正确的题目，而被“多数票”带偏的题目才是主流且不可逆；作者提出TTRL-Guard框架，通过监测“正确答案灭绝窗口”并采取动态奖励缩放、保留少数正确信号、暂停高风险更新等手段，在多个数学推理基准上显著提升了模型准确率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要