A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

📄 Abstract - A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

基于子词嵌入的卢森堡语用户评论变体检测方法 / A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

1️⃣ 一句话总结

这篇论文提出了一种无需预先标准化或变体词表的子词嵌入方法，通过分析原始文本中的拼写和形态变化来揭示卢森堡语用户评论中的系统性语言变体，为多语言和小语种的语言多样性研究提供了一个可复现的框架。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要