菜单

🤖 系统
📄 Abstract - SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [this https URL](this https URL)

顶级标签: llm model training data
详细标签: domain adaptation instruction tuning dataset generation retrieval-augmented generation supervised fine-tuning 或 搜索:

📄 论文总结

SearchInstruct:通过基于检索的指令数据集创建增强领域适应性 / SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation


1️⃣ 一句话总结

这篇论文提出了一种名为SearchInstruct的创新方法,它利用少量人工问题和大型语言模型自动扩展问题,并结合检索相关领域资源来生成高质量指令数据集,从而有效提升大语言模型在特定领域的适应性和性能。


📄 打开原文 PDF