构建、借用还是微调?政治科学家选择NLP模型的指南 / Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models
1️⃣ 一句话总结
这篇论文通过比较在冲突事件分类任务中微调通用模型与使用领域专用模型的性能差异,为政治科学家提供了一个实用的决策框架,指出模型选择的关键取决于任务中事件类别的常见程度、对错误的容忍度以及可用资源,而非抽象的性能优劣。
Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.
构建、借用还是微调?政治科学家选择NLP模型的指南 / Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models
这篇论文通过比较在冲突事件分类任务中微调通用模型与使用领域专用模型的性能差异,为政治科学家提供了一个实用的决策框架,指出模型选择的关键取决于任务中事件类别的常见程度、对错误的容忍度以及可用资源,而非抽象的性能优劣。
源自 arXiv: 2603.09595