BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

📄 Abstract - BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at this https URL.

BenHalluEval：面向孟加拉语大语言模型的多任务幻觉评估框架 / BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

1️⃣ 一句话总结

本文提出了首个专门针对孟加拉语的幻觉评估框架BenHalluEval，通过构建包含12000个幻觉样本的基准测试和双轨校准指标BenHalluScore，系统评估了7种主流大语言模型在四项任务中的幻觉表现，发现单轨评估和仅依赖思维链提示的方法在低资源语言环境下效果不佳。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要