大语言模型能力消除方法比较分析:一项跨架构评估 / Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
1️⃣ 一句话总结
这篇论文评估了四种用于移除大语言模型安全拒绝能力的工具在不同模型上的效果,发现数学推理能力受这些工具影响最大,为研究者选择合适工具提供了依据。
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
大语言模型能力消除方法比较分析:一项跨架构评估 / Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
这篇论文评估了四种用于移除大语言模型安全拒绝能力的工具在不同模型上的效果,发现数学推理能力受这些工具影响最大,为研究者选择合适工具提供了依据。
源自 arXiv: 2512.13655