菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Adversarial behavior plays a central role in aligning large language models with human values. However, existing alignment methods largely rely on static adversarial settings, which fundamentally limit robustness, particularly in multimodal settings with a larger attack surface. In this work, we move beyond static adversarial supervision and introduce co-evolutionary alignment with evolving attacks, instantiated by CEMMA (Co-Evolutionary Multi-Modal Alignment), an automated and adaptive framework for multimodal safety alignment. We introduce an Evolutionary Attacker that decomposes adversarial prompts into method templates and harmful intents. By employing genetic operators, including mutation, crossover, and differential evolution, it enables simple seed attacks to inherit the structural efficacy of sophisticated jailbreaks. The Adaptive Defender is iteratively updated on the synthesized hard negatives, forming a closed-loop process that adapts alignment to evolving attacks. Experiments show that the Evolutionary Attacker substantially increases red-teaming jailbreak attack success rate (ASR), while the Adaptive Defender improves robustness and generalization across benchmarks with higher data efficiency, without inducing excessive benign refusal, and remains compatible with inference-time defenses such as AdaShield.

顶级标签: multi-modal model training model evaluation
详细标签: adversarial alignment co-evolutionary learning multimodal safety jailbreak robustness genetic algorithms 或 搜索:

通过结构化对抗进化实现协同进化的多模态对齐 / Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution


1️⃣ 一句话总结

这篇论文提出了一个名为CEMMA的自动化自适应框架,通过让攻击者(不断进化生成更难破解的恶意提示)和防御者(持续学习这些新攻击来增强模型安全性)相互对抗、共同进化,从而更有效地提升多模态AI模型与人类价值观对齐的鲁棒性和泛化能力。

源自 arXiv: 2603.01784