DunbaaBERT:从牺牲到语义 / DunbaaBERT: From Sacrifice to Semantics
1️⃣ 一句话总结
本文提出了一种名为DunbaaBERT的乌尔都语专用预训练模型系列,通过在17GB语料上训练不同词汇量大小的模型,证明了即使使用较小词汇量和资源,针对特定语言精心设计的模型也能在多项任务上达到与强大多语言模型相当的性能,且效率更优。
Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.
DunbaaBERT:从牺牲到语义 / DunbaaBERT: From Sacrifice to Semantics
本文提出了一种名为DunbaaBERT的乌尔都语专用预训练模型系列,通过在17GB语料上训练不同词汇量大小的模型,证明了即使使用较小词汇量和资源,针对特定语言精心设计的模型也能在多项任务上达到与强大多语言模型相当的性能,且效率更优。
源自 arXiv: 2605.26935