菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-11
📄 Abstract - Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

顶级标签: machine learning model evaluation systems
详细标签: backdoor mitigation adversarial robustness latent space inference-time defense neural network security 或 搜索:

用FIRE消灭它:利用潜在空间方向在运行时缓解深度神经网络的后门攻击 / Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks


1️⃣ 一句话总结

这篇论文提出了一种名为FIRE的运行时防御方法,它通过分析并逆向修正神经网络内部特征空间中由后门触发器引发的特定变化,从而在不修改已部署模型的情况下,高效地抵御各种后门攻击。

源自 arXiv: 2602.10780