菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-10
📄 Abstract - Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior B of T as a single additive bias per head. The controller does not run. In practice, this incurs negligible overhead, a cached bias add per head, with no measurable p50 latency shift. Our results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.

顶级标签: model training natural language processing theory
详细标签: attention mechanisms efficient inference transformers regularization optimization 或 搜索:

通过长度感知注意力先验与增益感知训练实现固定测试时成本下的高效推理 / Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training


1️⃣ 一句话总结

这篇论文提出了一种在训练时引入长度感知注意力先验和增益感知控制器的方法,使得中小型Transformer模型在推理时无需增加任何计算开销,就能提升处理长序列和噪声数据的能力,从而在保持测试速度不变的情况下提高模型性能。

源自 arXiv: 2603.09253