← 返回列表

arXiv 提交日期: 2026-02-16

📄 Abstract - BFS-PO: Best-First Search for Large Reasoning Models

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

顶级标签: llm model training agents

BFS-PO：针对大型推理模型的最佳优先搜索算法 / BFS-PO: Best-First Search for Large Reasoning Models

1️⃣ 一句话总结

这篇论文提出了一种名为BFS-PO的新算法，它通过最佳优先搜索策略来训练大型推理模型，旨在解决模型因过度思考而产生的冗长回答和高计算成本问题，从而让模型在提高答案准确率的同时，生成更简洁的推理过程。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2602.14917

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要