丹青:一个最新的大规模中文视觉-语言预训练数据集 / DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
1️⃣ 一句话总结
这篇论文为了解决高质量中文图文数据稀缺的问题,构建了一个包含1亿对高质量、时效性强(主要来自2024-2025年)的中文图文数据集“丹青”,并通过实验证明使用该数据集训练的模型在多种中文下游任务上表现更优。
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
丹青:一个最新的大规模中文视觉-语言预训练数据集 / DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
这篇论文为了解决高质量中文图文数据稀缺的问题,构建了一个包含1亿对高质量、时效性强(主要来自2024-2025年)的中文图文数据集“丹青”,并通过实验证明使用该数据集训练的模型在多种中文下游任务上表现更优。
源自 arXiv: 2601.10305