FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

📄 Abstract - FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate $\mathcal{O}(1/\sqrt{R})$ under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.

FedQueue：面向跨设施高性能计算训练的队列感知联邦学习 / FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

1️⃣ 一句话总结

本文提出了一种名为FedQueue的联邦学习协议，通过预测和利用高性能计算设施中的任务调度队列延迟，智能地调整本地训练量、控制更新延迟并聚合异构模型，从而在跨设施的分布式训练场景中显著提升训练效率和模型精度。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要