菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training

Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.

顶级标签: llm natural language processing model training
详细标签: machine-generated text few-shot detection adversarial training robustness contrastive learning 或 搜索:

以毒攻毒:利用对抗训练增强少样本机器生成文本检测的鲁棒性 / Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training


1️⃣ 一句话总结

本文提出一种名为REACT的对抗训练框架,通过让一个攻击者使用检索增强生成技术制造逼真的机器文本陷阱,并让检测器在与这些陷阱对抗中学习,从而在仅有少量训练样本的情况下,大幅提升检测器识别机器生成文本的准确性和抗攻击能力。

源自 arXiv: 2605.02374