DASH:面向高吞吐量可复现大语言模型训练的确定性注意力调度方法 / DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
1️⃣ 一句话总结
这篇论文提出了一种名为DASH的新调度方法,通过优化计算和梯度累积操作的执行顺序,大幅提升了确定性大语言模型训练的效率,在保证结果可复现的同时显著减少了性能损失。
Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at this https URL.
DASH:面向高吞吐量可复现大语言模型训练的确定性注意力调度方法 / DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
这篇论文提出了一种名为DASH的新调度方法,通过优化计算和梯度累积操作的执行顺序,大幅提升了确定性大语言模型训练的效率,在保证结果可复现的同时显著减少了性能损失。
源自 arXiv: 2601.21824