SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

📄 Abstract - SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

SentGuard：面向大语言模型的句子级流式防护栏 / SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

1️⃣ 一句话总结

SentGuard提出了一种新颖的句子级安全监控方法，在大型语言模型逐句输出内容时实时检查每个句子的安全性，既避免了输出完整回复后才拦截的滞后问题，也克服了逐字检查因语义不完整而误判的缺陷，实验表明它能高效识别90.5%的安全风险，同时误报率较低。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要