菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.

顶级标签: llm natural language processing data
详细标签: verb alternations linguistic datasets cross-lingual evaluation paradigm learning data augmentation 或 搜索:

跨语言动词交替数据集:BLM模板与数据增强策略 / Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies


1️⃣ 一句话总结

这篇论文为英语、德语、意大利语和希伯来语创建了专门用于测试大语言模型理解动词不同用法模式能力的数据集,并提供了构建这些数据的方法和初步测试结果。

源自 arXiv: 2603.15295