📄
Abstract - VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
VERSE:视觉嵌入降维与空间探索——基于聚类指导的训练数据增强方法,用于富视觉文档理解 /
VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
1️⃣ 一句话总结
这篇论文提出了一个名为VERSE的方法,它通过分析和可视化视觉语言模型的内部表示,找出模型容易出错的区域,并据此生成针对性的合成数据来增强训练,从而显著提升模型在富视觉文档理解任务上的性能,甚至能让本地模型媲美云端商业模型。