菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-10
📄 Abstract - Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

The widespread adoption of thinking mode in large language models (LLMs) has significantly enhanced complex task processing capabilities while introducing new security risks. When subjected to jailbreak attacks, the step-by-step reasoning process may cause models to generate more detailed harmful content. We observe that thinking mode exhibits unique vulnerabilities when processing interleaved multiple tasks. Based on this observation, we propose multi-stream perturbation attack, which generates superimposed interference by interweaving multiple task streams within a single prompt. We design three perturbation strategies: multi-stream interleaving, inversion perturbation, and shape transformation, which disrupt the thinking process through concurrent task interleaving, character reversal, and format constraints respectively. On JailbreakBench, AdvBench, and HarmBench datasets, our method achieves attack success rates exceeding most methods across mainstream models including Qwen3 series, DeepSeek, Qwen3-Max, and Gemini 2.5 Flash. Experiments show thinking collapse rates and response repetition rates reach up to 17% and 60% respectively, indicating multi-stream perturbation not only bypasses safety mechanisms but also causes thinking process collapse or repetitive outputs.

顶级标签: llm model evaluation agents
详细标签: jailbreak attack safety alignment adversarial attack reasoning vulnerabilities multi-task interference 或 搜索:

多流扰动攻击:通过并发任务干扰打破思维型大语言模型的安全对齐 / Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference


1️⃣ 一句话总结

这篇论文发现,让大语言模型同时处理多个交织的任务(比如把不同问题混在一起问)可以干扰其逐步推理过程,从而成功绕过安全防护,使其生成有害内容或导致思维崩溃。

源自 arXiv: 2603.10091