菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-23
📄 Abstract - Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at this https URL.

顶级标签: computer vision data machine learning
详细标签: scene text recognition wordart autoregressive decoder synthetic dataset vision-language model 或 搜索:

推动面向艺术字的场景文字识别:数据集与方法 / Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods


1️⃣ 一句话总结

该论文针对艺术字(WordArt)的高度定制化风格给传统场景文字识别带来的挑战,从数据和模型两方面入手:构建了一个包含200万样本的大规模合成数据集WATER-S,并提出了一种支持任意形状输入和自回归解码的识别模型WATERec,最终在艺术字识别基准上达到了90.40%的准确率,远超现有通用和OCR专用模型。

源自 arXiv: 2606.24484