模型在困境中:针对法语合成社交媒体数据的情绪分析 / Model in Distress: Sentiment Analysis on French Synthetic Social Media
1️⃣ 一句话总结
这篇论文提出了一种通用的合成数据生成方法,通过反向翻译等技术,仅用少量初始数据就生成了大量法语社交媒体文本,成功训练出能准确识别用户不满情绪的模型,同时解决了标注成本高、多语言数据稀缺和用户隐私保护的问题。
Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.
模型在困境中:针对法语合成社交媒体数据的情绪分析 / Model in Distress: Sentiment Analysis on French Synthetic Social Media
这篇论文提出了一种通用的合成数据生成方法,通过反向翻译等技术,仅用少量初始数据就生成了大量法语社交媒体文本,成功训练出能准确识别用户不满情绪的模型,同时解决了标注成本高、多语言数据稀缺和用户隐私保护的问题。
源自 arXiv: 2604.18226