内部公平性动态:目标化大语言模型对齐中的偏见溢出效应 / Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment
1️⃣ 一句话总结
这篇论文研究发现,针对单一敏感属性(如性别)去优化大语言模型的公平性时,可能会无意中加剧模型在其他未受关注的属性(如外貌、性取向)上的偏见,尤其是在信息模糊的语境下,因此需要建立考虑多属性和具体语境的公平性评估框架。
Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.
内部公平性动态:目标化大语言模型对齐中的偏见溢出效应 / Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment
这篇论文研究发现,针对单一敏感属性(如性别)去优化大语言模型的公平性时,可能会无意中加剧模型在其他未受关注的属性(如外貌、性取向)上的偏见,尤其是在信息模糊的语境下,因此需要建立考虑多属性和具体语境的公平性评估框架。
源自 arXiv: 2602.16438