DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

📄 Abstract - DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

DHP：基于动态混合并行化的多模态大语言模型高效扩展训练方法 / DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

1️⃣ 一句话总结

本文提出了一种名为动态混合并行（DHP）的新训练方法，它能根据多模态数据的巨大差异自动调整计算资源的分配方式，从而在保持高效扩展的同时，显著提升多模态大语言模型的训练速度。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要