菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-02
📄 Abstract - DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

顶级标签: llm model training machine learning
详细标签: reinforcement learning from human feedback efficient fine-tuning human alignment data filtering distributional guidance 或 搜索:

DEFT:基于分布引导的高效微调用于人类对齐 / DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment


1️⃣ 一句话总结

这篇论文提出了一种名为DEFT的高效微调框架,它通过筛选高质量数据并引导模型输出分布,在提升大语言模型与人类价值观对齐效果的同时,减少了训练成本并保持了模型的泛化能力。

源自 arXiv: 2604.01787