一种用于数学表达式识别的混合视觉Transformer方法 / A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
1️⃣ 一句话总结
这篇论文提出了一种结合二维位置编码的混合视觉Transformer模型,通过改进的解码器跟踪注意力历史,有效解决了数学表达式识别中因二维结构和符号大小不一带来的难题,并在公开数据集上取得了超越现有最佳方法的性能。
One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention's history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.
一种用于数学表达式识别的混合视觉Transformer方法 / A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
这篇论文提出了一种结合二维位置编码的混合视觉Transformer模型,通过改进的解码器跟踪注意力历史,有效解决了数学表达式识别中因二维结构和符号大小不一带来的难题,并在公开数据集上取得了超越现有最佳方法的性能。
源自 arXiv: 2603.07929