菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-21
📄 Abstract - Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.

顶级标签: machine learning systems
详细标签: decision trees discretization skewed data random forest efficiency 或 搜索:

自适应MSD分割:针对偏斜连续属性改进C4.5和随机森林算法 / Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes


1️⃣ 一句话总结

本文提出了一种名为自适应MSD分割(AMSD)的改进方法,它能根据数据的偏斜程度动态调整分割区间,从而在保持高效计算速度的同时,显著提升了决策树和随机森林算法在处理不平衡连续数据(如医疗和金融数据)时的分类准确率。

源自 arXiv: 2604.19722