DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

📄 Abstract - DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill--Audit--Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.

DART：通过蒸馏-审计-修复训练缓解差异感知大语言模型中的有害偏移 / DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

1️⃣ 一句话总结

本文提出了一种名为DART的训练框架，通过先让模型学习何时该承认群体差异、再审计并修复回答中有害内容的“三步法”，有效解决了AI模型在提升差异识别准确性时意外生成更危险内容的问题，最终让模型既能准确回答涉及性别、种族等差异的问题，又保持安全无害。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要