菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-13
📄 Abstract - Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n = 1000$). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ($0.013$) but is significantly miscalibrated across all identity subgroups ($+0.029$ to $+0.134$). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC $+0.06$ to $+0.12$) but worsens the calibration-fairness gap by up to $+0.232$. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE $0.118$). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.

顶级标签: llm fairness model evaluation
详细标签: toxicity detection calibration abstention fairness evaluation training interventions 或 搜索:

结合稳健训练与弃权机制实现公平且校准的毒性检测 / Fair and Calibrated Toxicity Detection with Robust Training and Abstention


1️⃣ 一句话总结

本文揭示了当前毒性检测模型在公平性上的隐蔽问题:即使整体表现不错,模型对不同身份群体的预测置信度可能严重失衡,而常见的训练优化和事后补救方法不仅无法根除这种偏差,甚至可能让弃权机制本身也变得不公平,因此需要一种同时考虑排序、校准和弃权三个维度的多轴公平性评估框架。

源自 arXiv: 2605.14074