多任务代码大语言模型:数据混合还是模型合并? / Multi-task Code LLMs: Data Mix or Model Merge?
1️⃣ 一句话总结
这篇论文通过实验对比发现,在资源有限的情况下,对于大型模型(如7B参数),通过合并多个专用模型来创建多任务代码模型效果更好,能保持各项任务的高性能;而对于小型模型(如2B参数),在训练时直接混合多种任务数据则是更优的策略。
Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.
多任务代码大语言模型:数据混合还是模型合并? / Multi-task Code LLMs: Data Mix or Model Merge?
这篇论文通过实验对比发现,在资源有限的情况下,对于大型模型(如7B参数),通过合并多个专用模型来创建多任务代码模型效果更好,能保持各项任务的高性能;而对于小型模型(如2B参数),在训练时直接混合多种任务数据则是更优的策略。
源自 arXiv: 2601.21115