ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

📄 Abstract - ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.

ECHO：用于测试时强化学习的熵-置信度混合优化 / ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

1️⃣ 一句话总结

本文提出了一种名为ECHO的新方法，通过结合熵和置信度来智能控制决策树的分支与剪枝，有效解决了测试时强化学习中因探索效率低和早期伪标签噪声导致的性能下降问题，从而在多个推理任务上取得了更好的效果。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要