菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-29
📄 Abstract - Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

顶级标签: machine learning model training data
详细标签: dataset distillation knowledge distillation efficiency information theory synthetic data 或 搜索:

基于信息量与效用理论的数据集蒸馏方法研究 / Grounding and Enhancing Informativeness and Utility in Dataset Distillation


1️⃣ 一句话总结

这篇论文提出了一个名为InfoUtil的理论框架,通过结合博弈论和梯度分析,从海量数据中智能筛选出既包含关键信息又对模型训练至关重要的少量核心样本,从而在保持高性能的同时大幅压缩数据集规模。

源自 arXiv: 2601.21296