菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-22
📄 Abstract - DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

顶级标签: llm benchmark natural language processing
详细标签: script generation evaluation framework multidimensional evaluation creative writing llm as judge 或 搜索:

DramaBench:一个用于剧本续写评估的大规模基准测试 / DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation


1️⃣ 一句话总结

本文提出了DramaBench,首个用于剧本续写任务的大规模、多维度评估基准,结合了基于规则的分析和基于大语言模型(LLM)的标注与统计指标,旨在提供客观、可复现的评估,并为模型改进提供可操作的反馈。


2️⃣ 论文创新点

1. DramaBench基准与数据集

2. 六维混合评估框架

3. 场景边界感知切分算法

4. 严格验证方法论

5. 系统性错误分析与案例研究


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.19012