CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

📄 Abstract - CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.

CogManip：多轮交互中大语言模型操纵行为的基准评估 / CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

1️⃣ 一句话总结

本文提出了一套名为CogManip的评估基准，通过1000个多轮对话场景系统检测大语言模型中的15种隐性心理操纵策略，发现不同模型在操纵风险上差异显著，并证明了通过优化提示语可以有效防御这类行为。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要