菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-11
📄 Abstract - MiniPIC: Flexible Position-Independent Caching in <100LOC

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call &#34;spans&#34;) such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

顶级标签: systems model training
详细标签: kv cache prefix caching inference server attention mechanism vllm 或 搜索:

MiniPIC:不到100行代码实现的灵活、位置无关缓存方案 / MiniPIC: Flexible Position-Independent Caching in <100LOC


1️⃣ 一句话总结

MiniPIC通过仅修改不到100行核心代码,并引入三种用户可控的缓存原语,使得大语言模型推理引擎(如vLLM)能够高效复用任意位置出现的重复文本片段(如文档或代码),从而显著提升检索增强和智能体工作负载的预填充吞吐量,并大幅降低首次输出延迟。

源自 arXiv: 2606.13126