PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

📄 Abstract - PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-paged decode attention. However, the best single-kernel implementation is not always the best serving schedule: low-active long-context decode can under-utilize commodity GPUs, while mixed sequence lengths introduce a tension between many exact-length launches and coarse padded batches. We present PersistentKV, a native block-table decode attention engine and page-aware scheduling study for grouped-query attention (GQA). PersistentKV maps work by KV-head group, is designed to reuse K,V tiles across grouped query heads, supports native page tables, and adds a compact workqueue schedule that executes only non-empty row-KV-head-sequence-split tasks. On an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, d=128, and identical correctness tolerance against FlashInfer, a calibrated adaptive policy selects FlashInfer for small active batches, PersistentKV sequence splitting for B1 long-context steps, and PersistentKV workqueue scheduling for B8 long-context steps. With thresholds and split counts fixed on calibration traces, one held-out trace seed improves synchronized wall throughput by 1.063-1.265x on B8 bimodal, uniform, and Zipf-like workloads and by 1.399x on a B1 bucketed trace. On the B4 bimodal boundary case, the policy avoids the PersistentKV regression by selecting FlashInfer. These results identify a concrete systems niche for adaptive page-aware decode scheduling and show that work assignment, not only attention math, is a decisive serving-system variable.

PersistentKV：面向商用GPU长上下文LLM服务的页感知解码调度 / PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

1️⃣ 一句话总结

本文提出了一种名为PersistentKV的页感知解码调度引擎，通过将注意力计算按KV头分组、复用缓存块和智能任务调度，有效解决了长上下文大语言模型推理中KV缓存搬运导致的GPU利用率低问题，在不同负载下比现有最优方案实现了6%到40%的吞吐提升。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要