FORGE:面向制造场景的细粒度多模态评估 / FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
1️⃣ 一句话总结
这篇论文提出了一个名为FORGE的评估框架,通过构建包含真实2D图像和3D点云的细粒度标注数据集,评估了多模态大模型在制造业任务中的表现,发现其核心瓶颈并非视觉理解能力,而是缺乏领域专业知识,并证明了利用该数据集进行微调能显著提升模型在制造业场景下的准确性。
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at this https URL.
FORGE:面向制造场景的细粒度多模态评估 / FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
这篇论文提出了一个名为FORGE的评估框架,通过构建包含真实2D图像和3D点云的细粒度标注数据集,评估了多模态大模型在制造业任务中的表现,发现其核心瓶颈并非视觉理解能力,而是缺乏领域专业知识,并证明了利用该数据集进行微调能显著提升模型在制造业场景下的准确性。
源自 arXiv: 2604.07413