菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-25
📄 Abstract - Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

顶级标签: llm agents natural language processing
详细标签: self-correction privacy semantic sensitive information inference-time defense agentic editing 或 搜索:

超越拒绝:探究语义敏感信息代理式自我修正的极限 / Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information


1️⃣ 一句话总结

这篇论文提出了一个名为SemSIEdit的新方法,让大语言模型在回答问题时,能像一个‘编辑’一样主动识别并安全地改写可能泄露个人隐私或造成声誉损害的敏感内容,而不是简单地拒绝回答,从而在有效保护隐私的同时,最大程度地保留回答的有用性。

源自 arXiv: 2602.21496