TriPlay-RL:用于大语言模型安全对齐的三角色自博弈强化学习框架 / TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
1️⃣ 一句话总结
这篇论文提出了一个名为TriPlay-RL的强化学习框架,通过让攻击者、防御者和评估者三个角色在闭环中自动博弈与协同进化,显著提升了大语言模型的安全防御能力、攻击多样性和评估准确性,且无需人工标注。
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.
TriPlay-RL:用于大语言模型安全对齐的三角色自博弈强化学习框架 / TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
这篇论文提出了一个名为TriPlay-RL的强化学习框架,通过让攻击者、防御者和评估者三个角色在闭环中自动博弈与协同进化,显著提升了大语言模型的安全防御能力、攻击多样性和评估准确性,且无需人工标注。
源自 arXiv: 2601.18292