← 返回列表

arXiv 提交日期: 2026-05-14

📄 Abstract - Dynamic Latent Routing

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

顶级标签: llm reinforcement learning model training

动态潜在路由 / Dynamic Latent Routing

1️⃣ 一句话总结

本文提出了一种名为动态潜在路由（DLR）的方法，通过在训练过程中动态搜索和组合最优的子策略，使语言模型在处理时间变化任务时比传统微调方法平均提升6.6个百分点，并且能够学习到结构化的推理路径。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2605.14323

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要