菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-07
📄 Abstract - Linear Semantic Segmentation for Low-Resource Spoken Dialects

Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.

顶级标签: natural language processing benchmark low-resource
详细标签: semantic segmentation dialectal arabic discourse analysis spoken language low-resource 或 搜索:

面向低资源口语方言的线性语义分割方法 / Linear Semantic Segmentation for Low-Resource Spoken Dialects


1️⃣ 一句话总结

针对现有语义分割模型在低资源口语方言(如阿拉伯方言)中性能下降的问题,本文构建了一个覆盖日常对话、多语切换播客等场景的多体裁基准数据集(超过1000个样本),并提出一种专注于局部语义连贯性和应对话语中断的分割模型,在多种非新闻方言体裁上显著优于传统方法,且该方法可推广至其他低资源口语语言。

源自 arXiv: 2605.06276