菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-17
📄 Abstract - Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

顶级标签: llm natural language processing data
详细标签: low-resource languages lemmatization pos-tagging historical languages few-shot learning 或 搜索:

资源匮乏语言的研究困境:利用大语言模型为历史亚美尼亚语、格鲁吉亚语、希腊语和叙利亚语进行词形还原与词性标注 / Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac


1️⃣ 一句话总结

这篇论文发现,像GPT-4这样的大语言模型,即使不经过专门训练,也能在数据稀缺的情况下,有效地为几种古老且形态复杂的语言(如古希腊语、古典亚美尼亚语)自动完成词性标注和词形还原任务,为这些语言的数字化研究提供了新工具。

源自 arXiv: 2602.15753