📄
Abstract - Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{KV}$ and a 64-dim $k_r$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $\delta$-rotation rule for $k_r$. We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.
伊尔明苏尔:面向智能体大模型推理的原生多头潜注意力与位置无关缓存机制 /
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
1️⃣ 一句话总结
本文针对智能体大模型应用中因位置变化导致传统前缀缓存失效的问题,提出了一种名为伊尔明苏尔的内容寻址缓存系统,它利用多头潜注意力(MLA)架构中键值分离的特性,以闭式旋转修正替代全维度位置矫正,在多项主流大模型上恢复了高达83%的缓存命中率,并节省了63%的预填充能耗,主张将内容寻址缓存作为推理服务的一等公民。