CANDLE:基于轻量编码器的阿拉伯语字符级噪声去重方法 / CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder
1️⃣ 一句话总结
本文提出一种名为CANDLE的轻量级系统,利用连接主义时间分类(CTC)模型自动识别并消除阿拉伯语文本中因社交网络习惯而重复的字符,不使用任何人工规则或词典,并能通过模型压缩将处理速度提升三倍,同时使阿拉伯语大语言模型的词元切分效率最高提升12.8%。
Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37\%$ and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8\%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnote{this https URL}.
CANDLE:基于轻量编码器的阿拉伯语字符级噪声去重方法 / CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder
本文提出一种名为CANDLE的轻量级系统,利用连接主义时间分类(CTC)模型自动识别并消除阿拉伯语文本中因社交网络习惯而重复的字符,不使用任何人工规则或词典,并能通过模型压缩将处理速度提升三倍,同时使阿拉伯语大语言模型的词元切分效率最高提升12.8%。
源自 arXiv: 2606.24758