面向不平衡数据的聚焦式正例与未标注学习 / Focused PU learning from imbalanced data
1️⃣ 一句话总结
本文针对实际应用中常见的正例少、未标注数据多且类别严重不平衡的问题,提出了一种新的“聚焦式”正例与未标注学习方法,通过专门设计的经验风险估计器,有效提升了在正负样本极不均衡且难以区分场景下的分类性能,并在财务造假检测等真实任务中验证了其优越性。
We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.
面向不平衡数据的聚焦式正例与未标注学习 / Focused PU learning from imbalanced data
本文针对实际应用中常见的正例少、未标注数据多且类别严重不平衡的问题,提出了一种新的“聚焦式”正例与未标注学习方法,通过专门设计的经验风险估计器,有效提升了在正负样本极不均衡且难以区分场景下的分类性能,并在财务造假检测等真实任务中验证了其优越性。
源自 arXiv: 2605.14467