Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

📄 Abstract - Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.

LoRA中的子令牌路由：用于模型适配与查询感知的KV压缩 / Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

1️⃣ 一句话总结

该论文提出了一种在LoRA适配的Transformer模型中，将注意力键值（KV）压缩从传统的“整个令牌”级别细化到“令牌内部子结构”级别的路由方法，通过查询无关和查询感知两种设计，在保持模型质量的同时更高效地压缩上下文信息。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要