$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

📄 Abstract - $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically <5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

C-ΔΘ：用于选择性拒绝的电路限制权重算术 / $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

1️⃣ 一句话总结

这篇论文提出了一种名为C-ΔΘ的新方法，它通过离线修改大语言模型内部一个极小的特定电路（通常涉及不到5%的参数），就能让模型学会在特定情况下安全地拒绝回答，而无需在每次使用时都进行额外的计算干预，从而降低了部署成本和复杂性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要