菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-16
📄 Abstract - BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.

顶级标签: llm natural language processing data
详细标签: information retrieval low-resource languages dataset construction machine translation evaluation 或 搜索:

面向低资源信息检索的多语言数据集构建的BETA标注框架 / BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR


1️⃣ 一句话总结

本研究提出了一个结合多个大语言模型进行标注和验证的BETA框架,用于构建低资源语言的信息检索数据集,并揭示了通过机器翻译跨语言复用数据集存在语义保留不一致和语言依赖性偏见等风险。

源自 arXiv: 2602.14488