俄罗斯内外政策演讲中的链接多模态数据 / Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
1️⃣ 一句话总结
该论文构建了一个包含俄罗斯政府高层数十年演讲文本、图片及元数据的多模态、多语言数据集,通过独特的标识和专家校验,为分析威权政治传播及社会科学与大型语言模型应用提供了宝贵资源。
This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
俄罗斯内外政策演讲中的链接多模态数据 / Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
该论文构建了一个包含俄罗斯政府高层数十年演讲文本、图片及元数据的多模态、多语言数据集,通过独特的标识和专家校验,为分析威权政治传播及社会科学与大型语言模型应用提供了宝贵资源。
源自 arXiv: 2605.15886