菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (this https URL) and model (this https URL) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

顶级标签: natural language processing model training data
详细标签: punctuation restoration persian nlp sequence labeling low-resource languages bert fine-tuning 或 搜索:

PersianPunc:一个用于波斯语标点恢复的大规模数据集及基于BERT的方法 / PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration


1️⃣ 一句话总结

这篇论文创建了一个大规模高质量的波斯语标点恢复数据集,并提出了一个高效的基于BERT的模型,该模型在性能上优于大语言模型,解决了后者在标点恢复中容易过度修改和计算成本高的问题,为波斯语等资源匮乏语言的处理提供了实用方案。

源自 arXiv: 2603.05314