EMA-FS:通过增益感知特征筛选加速GBDT训练 / EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening
1️⃣ 一句话总结
本文提出了一种名为EMA-FS的方法,通过追踪每个特征在训练过程中的历史贡献(增益),只保留最有用的特征来构建直方图,从而在不显著影响预测效果的前提下,将LightGBM等梯度提升决策树(GBDT)模型的训练速度提升1.3至2.6倍,尤其适用于中高维度的稠密数据。
Gradient Boosted Decision Trees (GBDT), exemplified by LightGBM, spend a dominant fraction of training time -- typically 65-70% -- constructing per-feature histograms. Existing approaches such as random feature subsampling (feature_fraction) discard features without regard for their predictive utility. We propose EMA-based Feature Screening (EMA-FS), an algorithm-level optimization that maintains an exponential moving average (EMA) of per-feature split gains across boosting iterations and, after a short warmup, restricts histogram construction to the top-K features ranked by historical gain. Unlike random subsampling, EMA-FS is informed: it retains high-gain features while screening out low-gain ones. Operating at the per-tree level, it preserves full compatibility with LightGBM's histogram subtraction trick, requiring no changes to core routines. We evaluate EMA-FS on datasets spanning financial fraud detection, advertising click-through prediction, industrial quality control, and synthetic benchmarks, with feature dimensionalities from 29 to 968. On dense, moderate-to-high-dimensional data it achieves significant speedups: 2.61x on a 500-feature synthetic benchmark and 1.45x on the 432-feature IEEE-CIS Fraud dataset at 30% retention. At 70% retention it improves AUC by 0.11 points while delivering a 1.34x speedup. On extremely sparse data (Bosch, >90% missing) it yields no speedup, as LightGBM's sparse bin optimization already bypasses empty values. We further introduce Stochastic EMA-FS (S-EMA-FS), which replaces deterministic top-K selection with gain-weighted random sampling controlled by a concentration parameter beta, unifying deterministic EMA-FS (beta -> infinity) and random subsampling (beta = 0) in one framework. Both are implemented in ~120 lines of C++ across all six LightGBM tree learners and are fully backward-compatible.
EMA-FS:通过增益感知特征筛选加速GBDT训练 / EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening
本文提出了一种名为EMA-FS的方法,通过追踪每个特征在训练过程中的历史贡献(增益),只保留最有用的特征来构建直方图,从而在不显著影响预测效果的前提下,将LightGBM等梯度提升决策树(GBDT)模型的训练速度提升1.3至2.6倍,尤其适用于中高维度的稠密数据。
源自 arXiv: 2606.26337