菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-14
📄 Abstract - Dynamic Latent Routing

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

顶级标签: llm reinforcement learning model training
详细标签: policy composition latent routing post-training method supervised fine-tuning discrete latent codes 或 搜索:

动态潜在路由 / Dynamic Latent Routing


1️⃣ 一句话总结

本文提出了一种名为动态潜在路由(DLR)的方法,通过在训练过程中动态搜索和组合最优的子策略,使语言模型在处理时间变化任务时比传统微调方法平均提升6.6个百分点,并且能够学习到结构化的推理路径。

源自 arXiv: 2605.14323