📄
Abstract - Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($\kappa$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($\kappa = 0.907$, cosine=95.3%), followed by GPT-4o ($\kappa = 0.853$, cosine=92.6%) and Claude ($\kappa = 0.842$, cosine=92.1%). All three models achieve a high agreement ($\kappa > 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
基于双可靠性指标的多大语言模型主题分析:结合科恩卡帕系数与语义相似度以验证定性研究 /
Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation
1️⃣ 一句话总结
这篇论文提出了一个结合科恩卡帕系数和语义相似度双重指标的新框架,用于评估和验证多个大语言模型在定性主题分析中的可靠性,并通过实验证明该方法能有效提升AI辅助研究的可信度。