菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-22
📄 Abstract - Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{this https URL}{this https URL}.

顶级标签: llm agents model training
详细标签: credit assignment multi-turn optimization policy gradient agent training proximity weighting 或 搜索:

基于邻近性的多轮优化:面向大语言模型智能体训练的实用信用分配方法 / Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training


1️⃣ 一句话总结

这篇论文提出了一种名为ProxMO的新方法,它通过动态评估任务难度和步骤间的语义关联,更精准地识别并奖励大语言模型智能体在复杂多轮对话中的关键成功步骤,从而用更少的训练样本实现更高效的性能提升,并能轻松集成到现有的工业级训练流程中。

源自 arXiv: 2602.19225