DBLP:面向突发鲁棒性的分布式机器学习训练中相位感知有界丢失传输协议 / DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
1️⃣ 一句话总结
本文提出了一种名为DBLP的新型网络传输协议,它能根据模型训练的不同阶段动态调整梯度数据的丢失容忍度,从而有效缓解网络突发拥塞引起的训练延迟剧烈波动,最终将端到端训练时间平均缩短24.4%,并在突发事件中实现通信速度提升近6倍。
Distributed machine learning (ML) training has become a necessity with the prevalence of billion to trillion-parameter-scale models. While prior work has improved training efficiency from the ML perspective at the application layer, it often fails to address transient congestion events at the network layer that introduce severe tail latency and training-time variability, thereby undermining the quality of service (QoS) of distributed ML training systems. Existing network optimizations treat all gradients equally and thus fail to integrate sufficient model-training insights into communication protocol design. In this paper, we present Dynamic Bounded-Loss Protocol (DBLP), a burst-resilient, training-phase-aware, and hardware-agnostic transport protocol that incorporates model-level tolerance properties into gradient communication. By dynamically adjusting gradient loss tolerance across training phases, DBLP reduces overall training time and mitigates tail-latency collapse during transient high-loss events (i.e., microbursts). Compared to the current state-of-the-art solution (baseline), DBLP tolerates significantly higher loss while achieving comparable test accuracy, and reduces end-to-end training time by an average of 24.4% and a maximum of 33.9%. At microburst events, DBLP achieves up to 5.88x single-round communication latency speedups over the baseline, preventing burst-induced tail-latency spikes and maintaining stable training performance.
DBLP:面向突发鲁棒性的分布式机器学习训练中相位感知有界丢失传输协议 / DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
本文提出了一种名为DBLP的新型网络传输协议,它能根据模型训练的不同阶段动态调整梯度数据的丢失容忍度,从而有效缓解网络突发拥塞引起的训练延迟剧烈波动,最终将端到端训练时间平均缩短24.4%,并在突发事件中实现通信速度提升近6倍。
源自 arXiv: 2605.01989