菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-12
📄 Abstract - A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

顶级标签: natural language processing machine learning data
详细标签: subword embeddings variation detection low-resource languages linguistic analysis corpus linguistics 或 搜索:

基于子词嵌入的卢森堡语用户评论变体检测方法 / A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments


1️⃣ 一句话总结

这篇论文提出了一种无需预先标准化或变体词表的子词嵌入方法,通过分析原始文本中的拼写和形态变化来揭示卢森堡语用户评论中的系统性语言变体,为多语言和小语种的语言多样性研究提供了一个可复现的框架。

源自 arXiv: 2602.11795