菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-21
📄 Abstract - Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

顶级标签: llm model evaluation natural language processing
详细标签: difficulty prediction human-ai alignment educational assessment metacognition item response theory 或 搜索:

大型语言模型在预测题目难度时与人类认知困难的对齐问题 / Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction


1️⃣ 一句话总结

本研究通过大规模实证分析发现,大型语言模型在预测题目难度时与人类真实感知存在系统性错位,模型倾向于收敛于一种“机器共识”而非对齐人类认知,其强大的问题解决能力反而可能阻碍准确的难度估计,并揭示了模型在元认知和模拟特定熟练度学生方面的根本性局限。


2️⃣ 论文创新点

1. 人机难度对齐分析框架

2. 多维度分析框架

3. IDP任务形式化与双视角评估

4. 基于IRT的能力-感知差距量化

5. 双指标人机难度对齐评估框架

6. 基于熟练度配置的认知状态模拟

7. 系统性错位的发现与分析

8. 揭示机器共识与人类现实的系统性偏离

9. 集成与模拟方法的局限性分析

10. 认知分歧与知识诅咒

11. 元认知盲视

12. 元认知评估框架


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.18880