QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

📄 Abstract - QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

QCFuse：通过压缩视图实现查询感知缓存融合以高效支持RAG服务 / QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

1️⃣ 一句话总结

本文提出一种名为QCFuse的高效方法，通过压缩视图技术让系统在复用已计算好的缓存时，能快速识别出哪些检索内容与当前用户的查询最相关，从而大幅减少重复计算，提升AI助手的响应速度。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要