Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

📄 Abstract - Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.

重新思考注意力输出投影：用于高效Transformer的结构化哈达玛变换 / Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

1️⃣ 一句话总结

这篇论文提出用一种固定的、无需参数的哈达玛变换加上一个轻量级可学习的缩放操作，来替代Transformer中计算量大、参数多的注意力输出投影层，能在保持模型性能的同时显著减少参数、内存消耗并提升推理速度。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要