ClawdGo:面向自主人工智能智能体的内生安全意识训练 / Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents
1️⃣ 一句话总结
本文提出ClawdGo框架,通过让AI智能体在推理时自主扮演攻击者、防御者和评估者进行自对弈训练,在不修改模型的情况下,大幅提升其识别和应对提示注入、记忆投毒等内部安全威胁的能力,并发现了过度训练可能导致误报正常请求的新问题。
Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent's own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).
ClawdGo:面向自主人工智能智能体的内生安全意识训练 / Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents
本文提出ClawdGo框架,通过让AI智能体在推理时自主扮演攻击者、防御者和评估者进行自对弈训练,在不修改模型的情况下,大幅提升其识别和应对提示注入、记忆投毒等内部安全威胁的能力,并发现了过度训练可能导致误报正常请求的新问题。
源自 arXiv: 2604.24020