菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-22
📄 Abstract - RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

We present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft model generates a candidate response; a capable verify model accepts, enhances, or is bypassed entirely depending on a lightweight complexity router. On a real-world agentic coding workload (Claude Code), RLM-Cascade achieves a draft-use rate of 88.8% across 125 production requests, reducing API cost by 45.8% relative to a direct Opus baseline. Counter-intuitively, the proxy also reduces end-to-end latency: median response time is 2,026 ms versus 3,698 ms for Native Opus -- a 1.83X speedup at p50 -- because the SKIPPED path (DeepSeek only, no Opus call) dominates the workload distribution. Quality matches or exceeds the Opus baseline: 100% pass rate on a 20-task Code/Math/Instruct benchmark versus 95% for Native Opus. We further describe a rule-based complexity router that selects the SKIPPED path for simple agentic turns and a hybrid tool-call strategy that bypasses the speculative pipeline for schema-critical tool-selection turns. RLM-Cascade is deployed in production as an enterprise AI infrastructure component and published as open source with a live metrics dashboard and Prometheus endpoint.

顶级标签: llm systems
详细标签: speculative decoding cost reduction llm api serving latency optimization proxy-layer system 或 搜索:

RLM-Cascade:一种在响应层面进行推测解码、降低大语言模型API服务成本的代理层系统 / RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving


1️⃣ 一句话总结

本文提出了RLM-Cascade,一个在LLM API之上搭建的“代理层”系统,它通过让一个小模型先快速生成回答草稿,再由一个“路由”机制判断是否直接使用这个草稿、或交给大模型精修、或完全跳过小模型,在无需改动底层大模型的前提下,将编程助手场景下的API调用成本降低了近46%,同时还将响应速度提升了近一倍,且回答质量不降反升。

源自 arXiv: 2606.22840