菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-29
📄 Abstract - Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

顶级标签: multi-modal model training benchmark
详细标签: multimodal llm domain-specific data curation retail evaluation benchmark 或 搜索:

Ostrakon-VL:面向餐饮与零售商店的领域专家多模态大语言模型 / Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores


1️⃣ 一句话总结

这篇论文提出了一个专门为餐饮和零售商店场景设计的智能视觉语言模型Ostrakon-VL,通过创新的数据清洗方法和首个行业公开测试基准,在保持较小模型规模的同时,实现了超越更大模型的性能,能更可靠地理解和处理商店环境中的复杂视觉信息。

源自 arXiv: 2601.21342