📄
Abstract - MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.
MHSafeEval:大型语言模型心理健康安全性的角色感知交互级评估 /
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
1️⃣ 一句话总结
这篇论文提出了一个名为MHSafeEval的新评估框架,通过模拟多轮对话并分析AI在心理咨询中可能扮演的四种有害角色,来系统性地发现和诊断大型语言模型在心理健康应用中的安全隐患,弥补了现有静态评估方法的不足。