菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-11
📄 Abstract - Disparate Impact in Synthetic Data Generation

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

顶级标签: machine learning data model evaluation
详细标签: fairness disparate impact synthetic data generation differential privacy probabilistic graphical models 或 搜索:

合成数据生成中的差异性影响 / Disparate Impact in Synthetic Data Generation


1️⃣ 一句话总结

这篇论文重新探讨了合成数据生成中的公平性问题,指出当合成数据与真实数据分布一致时,才能避免对不同敏感群体产生差异性影响,并分析了导致差异的原因及一种通过分组建模来改善公平性的策略。

源自 arXiv: 2606.13105