层逐收敛指纹:用于大语言模型运行时异常行为检测 / Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
1️⃣ 一句话总结
本文提出一种无需微调或修改模型的运行时监测方法,通过分析大模型各层隐藏状态的轨迹模式(类似指纹),能同时检测后门攻击、越狱提示和提示注入等多种异常行为,在多种主流大模型上均能显著降低攻击成功率,且计算开销极低。
Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.
层逐收敛指纹:用于大语言模型运行时异常行为检测 / Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
本文提出一种无需微调或修改模型的运行时监测方法,通过分析大模型各层隐藏状态的轨迹模式(类似指纹),能同时检测后门攻击、越狱提示和提示注入等多种异常行为,在多种主流大模型上均能显著降低攻击成功率,且计算开销极低。
源自 arXiv: 2604.24542