FT-Dojo:迈向基于语言智能体的自主大语言模型微调 / FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
1️⃣ 一句话总结
这篇论文提出了一个名为FT-Dojo的交互式环境和一个名为FT-Agent的自主系统,首次尝试让基于大语言模型的智能体像人类专家一样,自动化完成从数据收集、处理、训练到迭代优化的整个模型微调过程,并在多项任务中验证了其有效性,同时揭示了当前方法在因果推理方面的局限性。
Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.
FT-Dojo:迈向基于语言智能体的自主大语言模型微调 / FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
这篇论文提出了一个名为FT-Dojo的交互式环境和一个名为FT-Agent的自主系统,首次尝试让基于大语言模型的智能体像人类专家一样,自动化完成从数据收集、处理、训练到迭代优化的整个模型微调过程,并在多项任务中验证了其有效性,同时揭示了当前方法在因果推理方面的局限性。
源自 arXiv: 2603.01712