菜单

🤖 系统
📄 Abstract - TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: this https URL

顶级标签: computer vision natural language processing multi-modal
详细标签: table recognition self-supervised learning vision-language models fine-tuning document parsing 或 搜索:

TRivia:用于表格识别的视觉语言模型自监督微调方法 / TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition


1️⃣ 一句话总结

这篇论文提出了一种名为TRivia的自监督微调方法,让视觉语言模型无需人工标注数据,就能直接从大量无标签表格图片中学习识别和结构化表格,并基于此训练出了一个性能超越现有先进系统的开源模型TRivia-3B。


📄 打开原文 PDF