在大语言模型中利用非对抗鲁棒性 / Harnessing non-adversarial robustness in large language models
1️⃣ 一句话总结
本文提出了一种无需重新训练整个模型的方法,通过简单的去偏微调过程,就能让大语言模型对语义相似但表述不同的提示词变化(如文字替换或输入噪声)保持稳定表现,并理论分析了影响鲁棒性的关键因素——神经网络模块中的系统性偏差偏移。
The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.
在大语言模型中利用非对抗鲁棒性 / Harnessing non-adversarial robustness in large language models
本文提出了一种无需重新训练整个模型的方法,通过简单的去偏微调过程,就能让大语言模型对语义相似但表述不同的提示词变化(如文字替换或输入噪声)保持稳定表现,并理论分析了影响鲁棒性的关键因素——神经网络模块中的系统性偏差偏移。
源自 arXiv: 2605.29816