RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

📄 Abstract - RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

We present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft model generates a candidate response; a capable verify model accepts, enhances, or is bypassed entirely depending on a lightweight complexity router. On a real-world agentic coding workload (Claude Code), RLM-Cascade achieves a draft-use rate of 88.8% across 125 production requests, reducing API cost by 45.8% relative to a direct Opus baseline. Counter-intuitively, the proxy also reduces end-to-end latency: median response time is 2,026 ms versus 3,698 ms for Native Opus -- a 1.83X speedup at p50 -- because the SKIPPED path (DeepSeek only, no Opus call) dominates the workload distribution. Quality matches or exceeds the Opus baseline: 100% pass rate on a 20-task Code/Math/Instruct benchmark versus 95% for Native Opus. We further describe a rule-based complexity router that selects the SKIPPED path for simple agentic turns and a hybrid tool-call strategy that bypasses the speculative pipeline for schema-critical tool-selection turns. RLM-Cascade is deployed in production as an enterprise AI infrastructure component and published as open source with a live metrics dashboard and Prometheus endpoint.

RLM-Cascade：一种在响应层面进行推测解码、降低大语言模型API服务成本的代理层系统 / RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

1️⃣ 一句话总结

本文提出了RLM-Cascade，一个在LLM API之上搭建的“代理层”系统，它通过让一个小模型先快速生成回答草稿，再由一个“路由”机制判断是否直接使用这个草稿、或交给大模型精修、或完全跳过小模型，在无需改动底层大模型的前提下，将编程助手场景下的API调用成本降低了近46%，同时还将响应速度提升了近一倍，且回答质量不降反升。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要