HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

📄 Abstract - HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

HoWToBench：基于写作树的全方位评估大语言模型人类级写作能力 / HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

1️⃣ 一句话总结

本文提出了一种名为Tree-of-Writing（ToW）的新评估方法，通过树状结构显式建模写作质量的多个子特征权重，解决了现有AI评判方法在长文本写作评估中的不一致性问题，并基于此构建了包含12种体裁和1302个指令的中文写作基准HowToBench，实验表明ToW与人类评分的相关性高达0.93，且对文本干扰具有鲁棒性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要