少即是多:在更长周期内减少数据权重更新的收敛优势 / Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon
1️⃣ 一句话总结
这篇论文通过理论分析和实验证明,在训练机器学习模型时,用于调整不同训练数据源权重的‘数据混合’任务中,相比频繁更新权重,将计算资源更多地用于模型参数更新(即减少权重更新次数但每次更新前进行更充分的模型训练)反而能带来更好的收敛效果。
Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $\Theta(\log N)$ (resp., $\Theta({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.
少即是多:在更长周期内减少数据权重更新的收敛优势 / Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon
这篇论文通过理论分析和实验证明,在训练机器学习模型时,用于调整不同训练数据源权重的‘数据混合’任务中,相比频繁更新权重,将计算资源更多地用于模型参数更新(即减少权重更新次数但每次更新前进行更充分的模型训练)反而能带来更好的收敛效果。
源自 arXiv: 2602.19510