菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-26
📄 Abstract - AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

顶级标签: agents model evaluation systems
详细标签: agent safety risk diagnosis guardrail framework benchmark transparency 或 搜索:

AgentDoG:一个用于AI智能体安全与安全的诊断性护栏框架 / AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security


1️⃣ 一句话总结

这篇论文提出了一个名为AgentDoG的智能诊断框架,它不仅能像‘交通护栏’一样防止AI智能体在执行任务时做出危险行为,还能像‘汽车诊断仪’一样深入分析危险行为的具体根源和原因,从而更有效地提升AI智能体的安全性。

源自 arXiv: 2601.18491