菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-19
📄 Abstract - Entropy-Based Data Selection for Language Models

Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

顶级标签: llm model training data
详细标签: data selection entropy fine-tuning uncertainty estimation computational efficiency 或 搜索:

基于熵的语言模型数据选择方法 / Entropy-Based Data Selection for Language Models


1️⃣ 一句话总结

本文提出了一种基于熵的无监督数据选择框架,能在计算资源受限的情况下,高效筛选出高质量的训练数据,从而显著降低大语言模型微调所需的计算成本和数据量。

源自 arXiv: 2602.17465