CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

📄 Abstract - CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37\%$ and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8\%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnote{this https URL}.

CANDLE：基于轻量编码器的阿拉伯语字符级噪声去重方法 / CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

1️⃣ 一句话总结

本文提出一种名为CANDLE的轻量级系统，利用连接主义时间分类（CTC）模型自动识别并消除阿拉伯语文本中因社交网络习惯而重复的字符，不使用任何人工规则或词典，并能通过模型压缩将处理速度提升三倍，同时使阿拉伯语大语言模型的词元切分效率最高提升12.8%。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要