菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-27
📄 Abstract - Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as "positive backdoors" has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.

顶级标签: llm model evaluation security
详细标签: backdoor attacks alignment trigger behaviors systematic evaluation proprietary models 或 搜索:

立场:废除“良性后门”标签——秘密对齐需要严格且系统的评估 / Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation


1️⃣ 一句话总结

本文指出,AI领域不应再将模型中被特定触发条件激活的隐藏行为视为“良性后门”,而应称之为“秘密对齐”,并强调除非经过严格和标准化评估验证,否则不应默认这些防御措施是安全的;作者通过实验揭示了这类方法在保密性、完整性和可用性上的脆弱性,呼吁社区建立可证明的安全评估标准。

源自 arXiv: 2605.28597