菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-30
📄 Abstract - Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

顶级标签: llm model training systems
详细标签: adversarial fine-tuning safety bypass constitutional classifiers curriculum learning reinforcement learning 或 搜索:

特洛伊之语:通过对抗性微调绕过宪法分类器且不牺牲模型性能 / Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning


1️⃣ 一句话总结

这篇论文提出了一种名为“特洛伊之语”的对抗性微调方法,它能让大型语言模型学会一种隐蔽的沟通方式,从而有效绕过AI安全审查系统,同时几乎不损害模型原有的正常推理能力。

源自 arXiv: 2603.29038