菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

顶级标签: natural language processing llm
详细标签: toxic span detection sequence labeling multilingual transformers conditional random fields urdu nlp 或 搜索:

MUTEX:利用多语言Transformer与条件随机场增强乌尔都语有毒文本片段检测 / MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection


1️⃣ 一句话总结

这项研究提出了一个结合多语言Transformer和条件随机场的新模型MUTEX,首次为乌尔都语建立了能精准识别句子中有毒词汇片段(而非仅判断整句)的监督基线系统,有效应对了该语言因形态复杂、语码混合等带来的检测挑战。

源自 arXiv: 2603.05057