菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

顶级标签: llm model training systems
详细标签: long-context reasoning post-training reinforcement learning memory management data synthesis 或 搜索:

QwenLong-L1.5:通过系统性后训练创新实现卓越的长上下文推理能力 / QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management


1️⃣ 一句话总结

本文提出了QwenLong-L1.5模型,通过一套整合了高质量数据合成、稳定强化学习训练和超长上下文记忆增强架构的系统性后训练方案,显著提升了模型在长上下文推理任务上的性能,使其在多个基准测试中达到与顶尖模型相当的水平。


2️⃣ 论文创新点

1. 长上下文数据合成流水线

2. 用于长上下文训练的稳定强化学习

3. 用于超长上下文的记忆增强架构

4. Token级策略梯度损失与KL正则化移除

5. 带规划机制的记忆智能体


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.12967