菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-30
📄 Abstract - Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

顶级标签: computer vision multi-modal data
详细标签: industrial defect detection vision-language model dataset diffusion model domain adaptation 或 搜索:

迈向开放词汇工业缺陷理解:基于大规模多模态数据集 / Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset


1️⃣ 一句话总结

这篇论文构建了一个包含百万级图像-文本对的大型工业缺陷数据集IMDD-1M,并基于此训练了一个通用的视觉-语言基础模型,该模型只需少量数据微调就能在多种工业检测任务上达到专家模型水平,为实现高效、可扩展的智能制造质检提供了新方案。

源自 arXiv: 2512.24160