菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - PARTREP: Learning What to Repeat for Decoder-only LLMs

While decoder-only LLMs excel at a vast array of natural language tasks, it suffers from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones. A simple and effective remedy is prompt repetition -- just appending a second copy of prompt before generation can redistribute grounding across positions and improve reasoning performance. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long-context settings. We propose PartRep, a selective augmentation method that appends only the most informative tokens -- rather than the entire prompt. We use token-wise negative log-likelihood (NLL) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from surrounding context and therefore benefit more from late-position repetition. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high-NLL tokens from early-layer hidden states, enabling token selection during mid-prefill via early exit. Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.

顶级标签: llm model training
详细标签: prompt repetition kv cache optimization token selection negative log-likelihood early exit 或 搜索:

PARTREP:学习在仅解码器大语言模型中重复什么内容 / PARTREP: Learning What to Repeat for Decoder-only LLMs


1️⃣ 一句话总结

针对仅解码器大语言模型因因果注意力机制导致前后位置信息不对称的问题,本文提出一种名为PartRep的高效方法,它通过选择性地重复提示中最具信息量的词语(而非完整重复),在显著降低计算和内存开销的同时,保留了完整重复提示带来的大部分性能提升。

源自 arXiv: 2607.01792