C-ΔΘ:用于选择性拒绝的电路限制权重算术 / $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
1️⃣ 一句话总结
这篇论文提出了一种名为C-ΔΘ的新方法,它通过离线修改大语言模型内部一个极小的特定电路(通常涉及不到5%的参数),就能让模型学会在特定情况下安全地拒绝回答,而无需在每次使用时都进行额外的计算干预,从而降低了部署成本和复杂性。
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically <5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
C-ΔΘ:用于选择性拒绝的电路限制权重算术 / $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
这篇论文提出了一种名为C-ΔΘ的新方法,它通过离线修改大语言模型内部一个极小的特定电路(通常涉及不到5%的参数),就能让模型学会在特定情况下安全地拒绝回答,而无需在每次使用时都进行额外的计算干预,从而降低了部署成本和复杂性。
源自 arXiv: 2602.04521