AutoSP:基于编译器的序列并行技术,实现长上下文大语言模型训练 / AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
1️⃣ 一句话总结
本文提出了一种名为AutoSP的自动化工具,它通过编译器自动优化大语言模型的长文本训练,无需用户手动编写复杂代码,即可在NVIDIA和AMD硬件上将可训练上下文长度提升2.5到2.7倍,同时几乎不影响运行速度。
Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context training, instead focusing on optimizations for models with large parameter counts through ZeRO-3/FSDP, Tensor and Pipeline parallelism. This forces users to rewrite LLM training libraries to incorporate compositions of various complex long-context optimizations, such as sequence-parallelism, to training pipelines; a process that requires in-depth expertise, reducing developer productivity. To tackle these challenges, we introduce AutoSP: the first automated solution to automatically optimize LLM training for longer-contexts. AutoSP compiles models and applies a targeted set of optimizations: automated sequence parallelism, and long-context aware activation-checkpointing, to drastically enhance LLM trainability at negligible cost to throughput. Our evaluation demonstrates AutoSP's capability on both NVIDIA and AMD hardware, increasing training contexts by upto 2.7$\times$ and 2.5$\times$ respectively over competitive hand-written baseline at negligible cost to runtime performance.
AutoSP:基于编译器的序列并行技术,实现长上下文大语言模型训练 / AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
本文提出了一种名为AutoSP的自动化工具,它通过编译器自动优化大语言模型的长文本训练,无需用户手动编写复杂代码,即可在NVIDIA和AMD硬件上将可训练上下文长度提升2.5到2.7倍,同时几乎不影响运行速度。
源自 arXiv: 2604.27089