菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees

Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present $\methodprop$, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: this https URL}

顶级标签: machine learning model training data
详细标签: prototype selection optimal transport submodular optimization imbalanced data subset selection 或 搜索:

UniPROT:基于部分最优传输与次模保证的均匀原型选择方法 / UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees


1️⃣ 一句话总结

这篇论文提出了一种名为UniPROT的新方法,通过数学优化确保从数据集中选出的代表性样本(原型)具有均匀的权重,从而有效改善数据不平衡时少数类别的代表性,同时保持整体性能,并提供了理论保证和实际验证。

源自 arXiv: 2604.10952