RW-TTT: Batched Serving for Request-Owned Test-Time Training State

📄 Abstract - RW-TTT: Batched Serving for Request-Owned Test-Time Training State

Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.

RW-TTT：面向请求专属测试时训练状态的批量服务 / RW-TTT: Batched Serving for Request-Owned Test-Time Training State

1️⃣ 一句话总结

本文提出了一种名为RW-TTT的批量服务方法，它允许大语言模型在生成回答时，为每个用户请求维护独立的“测试时训练”状态（如快速权重或低秩增量），通过智能地组合兼容的解码步骤并仅在安全时更新状态，实现了相比传统顺序执行近10倍的速度提升，且不改变模型行为。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要