Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

📄 Abstract - Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.

面向位置感知的生成式列表推荐模型推理加速方法 / Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

1️⃣ 一句话总结

为了加速大语言模型在推荐系统生成推荐列表时的推理过程，本文提出了一种轻量级的“位置感知草稿”模块，通过区分每个推荐商品内部不同位置的标记（token）以及草稿生成步骤的深度，让小型草稿模型能够更精准地预测候选标记，从而提升并行验证效率，在保持推荐质量的同时实现了最高3.1倍的推理速度提升。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要