GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

📄 Abstract - GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at this https URL.

GRADE：面向AI导师的通用推理感知对话评估方法 / GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

1️⃣ 一句话总结

本文提出了GRADE方法，系统研究如何让开源模型像人类导师一样评估教学对话，发现通过精心优化的微调策略，小规模开源模型能够在识别错误、提供指导等教学维度上达到甚至超越封闭源模型，同时大幅降低计算成本和碳排放。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要