Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

📄 Abstract - Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.

超越信条：一种非身份安全条件——低数据LoRA微调中身份框架的强有力实证替代方案 / Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

1️⃣ 一句话总结

这篇论文通过实验发现，在提升AI模型安全性的微调过程中，使用不强调AI身份的普通规则说明，比刻意强调AI身份（如“我是一个无害的AI”）的“信条式”框架效果更好，且不影响模型的其他能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要