菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...

顶级标签: llm natural language processing model training
详细标签: continual learning toxicity detection adversarial robustness semantic enrichment feature learning 或 搜索:

ContiGuard:一个针对不断演化的规避性扰动的持续毒性检测框架 / ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations


1️⃣ 一句话总结

这篇论文提出了一个名为ContiGuard的新框架,它利用大语言模型增强语义理解并优化特征学习,使在线毒性检测系统能够持续学习、动态更新,从而有效应对恶意用户不断变化的新型文本规避手段。

源自 arXiv: 2603.14843