菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-09
📄 Abstract - Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.

顶级标签: llm model evaluation natural language processing
详细标签: inference-time control uncertainty reasoning self-correction confidence calibration 或 搜索:

强化推理:利用不确定性实现语言模型推理的自我纠正 / Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning


1️⃣ 一句话总结

这篇论文提出了一种名为‘强化推理’的新方法,它能让大型语言模型在回答问题时,通过检测自身回答的‘不确定程度’,智能地决定是否需要重新思考一遍,从而在不重新训练模型的情况下,显著提升回答的准确率。

源自 arXiv: 2602.08520