MiniPIC:不到100行代码实现的灵活、位置无关缓存方案 / MiniPIC: Flexible Position-Independent Caching in <100LOC
1️⃣ 一句话总结
MiniPIC通过仅修改不到100行核心代码,并引入三种用户可控的缓存原语,使得大语言模型推理引擎(如vLLM)能够高效复用任意位置出现的重复文本片段(如文档或代码),从而显著提升检索增强和智能体工作负载的预填充吞吐量,并大幅降低首次输出延迟。
Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.
MiniPIC:不到100行代码实现的灵活、位置无关缓存方案 / MiniPIC: Flexible Position-Independent Caching in <100LOC
MiniPIC通过仅修改不到100行核心代码,并引入三种用户可控的缓存原语,使得大语言模型推理引擎(如vLLM)能够高效复用任意位置出现的重复文本片段(如文档或代码),从而显著提升检索增强和智能体工作负载的预填充吞吐量,并大幅降低首次输出延迟。
源自 arXiv: 2606.13126