菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-18
📄 Abstract - Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

顶级标签: llm natural language processing data
详细标签: dataset multilingual classification political text analysis teacher-student framework policy topic classification 或 搜索:

赋能议程设置研究:涵盖28个欧洲议会的ParlaCAP数据集与可扩展的多语言大语言模型分类方法 / Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification


1️⃣ 一句话总结

这篇论文提出了一个名为ParlaCAP的大规模欧洲议会数据集,并开发了一种高效、低成本的方法,利用大语言模型自动标注数据来训练专门分析议会政策议题的分类器,从而帮助研究者比较不同国家议会的关注焦点和情感倾向。

源自 arXiv: 2602.16516