代码的扩展定律:每种编程语言都至关重要 / Scaling Laws for Code: Every Programming Language Matters
1️⃣ 一句话总结
这篇论文首次系统地研究了多语言代码大模型的扩展规律,发现不同编程语言对模型性能的影响差异巨大,并提出了一种通过优化训练数据中各种语言的配比来显著提升模型整体性能的新方法。
Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
代码的扩展定律:每种编程语言都至关重要 / Scaling Laws for Code: Every Programming Language Matters
这篇论文首次系统地研究了多语言代码大模型的扩展规律,发现不同编程语言对模型性能的影响差异巨大,并提出了一种通过优化训练数据中各种语言的配比来显著提升模型整体性能的新方法。
源自 arXiv: 2512.13472