Self-supervised Pre-training of Text Recognizers

2405.00420

Published 5/2/2024 by Martin Kiv{s}v{s}, Michal Hradiv{s}

Self-supervised Pre-training of Text Recognizers

Abstract

In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://github.com/DCGM/pero-pretraining.

Create account to get full access

Overview

This paper explores self-supervised pre-training for text recognition models, aiming to improve their performance without the need for large labeled datasets.
The researchers propose a novel self-supervised learning framework that leverages both textual and visual cues to learn robust representations for text recognition tasks.
The proposed approach is evaluated on various text recognition benchmarks and achieves state-of-the-art results, demonstrating the effectiveness of self-supervised pre-training for text recognition.

Plain English Explanation

Text recognition models are computer systems that can automatically extract and interpret text from images or documents. These models are commonly used in applications like document digitization, license plate reading, and handwriting analysis. However, training these models typically requires large labeled datasets, which can be time-consuming and expensive to obtain.

The researchers in this paper propose a new approach to train text recognition models using a self-supervised learning technique. Self-supervised learning is a type of machine learning where the model learns useful representations from the data itself, without the need for manual labeling. In this case, the researchers leverage both the textual and visual information in the training data to help the model learn better representations for text recognition tasks.

The key idea is to train the model to predict the text content of an image, even when some parts of the text are masked or hidden. By learning to fill in the missing text, the model can develop a deeper understanding of the relationship between the visual and textual features, which ultimately helps it perform better on text recognition tasks.

The researchers evaluate their approach on several text recognition benchmarks and find that it outperforms other state-of-the-art methods. This suggests that self-supervised pre-training can be a powerful technique for improving the performance of text recognition models, especially when labeled data is scarce.

Technical Explanation

The paper proposes a self-supervised pre-training framework for text recognition models, called SSPT. The framework consists of two key components:

Textual Masking: The model is trained to predict the missing text content in a partially masked input image. This helps the model learn robust representations that capture the relationship between the visual and textual features.
Visual Embedding Learning: The model is also trained to learn visual embeddings that capture the spatial and semantic information of the text in the image. This is achieved by applying a contrastive loss to the visual features, encouraging the model to learn discriminative representations.

The researchers evaluate their approach on several text recognition benchmarks, including ICDAR 2013, ICDAR 2015, and ReCTS. The results show that their self-supervised pre-training framework outperforms other state-of-the-art approaches, achieving new state-of-the-art performance on these benchmarks.

Critical Analysis

The paper presents a compelling approach for improving text recognition models using self-supervised pre-training. The key strengths of the proposed framework include its ability to learn robust representations without the need for large labeled datasets, as well as its strong performance on various text recognition benchmarks.

However, the paper does not address some potential limitations of the approach. For instance, it is unclear how the framework would perform on more diverse or challenging text recognition tasks, such as those involving complex layouts, multi-lingual text, or degraded image quality. Additionally, the researchers do not provide a detailed analysis of the model's interpretability or the representations it learns during the pre-training stage.

Further research could explore the generalizability of the self-supervised pre-training approach to other text recognition tasks and datasets, as well as investigate ways to improve the interpretability and transparency of the learned representations.

Conclusion

This paper presents a novel self-supervised pre-training framework for text recognition models, which leverages both textual and visual cues to learn robust representations. The approach achieves state-of-the-art performance on several text recognition benchmarks, demonstrating the effectiveness of self-supervised learning for improving the performance of text recognition models without the need for large labeled datasets.

The proposed framework has the potential to significantly impact the development of text recognition systems, especially in domains where labeled data is scarce. By reducing the reliance on manual labeling, the self-supervised pre-training approach can make text recognition more accessible and cost-effective, with broader applications in areas like document digitization, assistive technology, and autonomous systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

Gabriel Meseguer-Brocal, Dorian Desblancs, Romain Hennequin

Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.

4/16/2024

cs.SD cs.LG eess.AS

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Zuan Gao, Yuxin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie

In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.

5/14/2024

cs.CV

🤷

Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation

Dingwen Zhang, Hao Li, Diqi He, Nian Liu, Lechao Cheng, Jingdong Wang, Junwei Han

In recent times, following the paradigm of DETR (DEtection TRansformer), query-based end-to-end instance segmentation (QEIS) methods have exhibited superior performance compared to CNN-based models, particularly when trained on large-scale datasets. Nevertheless, the effectiveness of these QEIS methods diminishes significantly when confronted with limited training data. This limitation arises from their reliance on substantial data volumes to effectively train the pivotal queries/kernels that are essential for acquiring localization and shape priors. To address this problem, we propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts (UPLVP), which improves QEIS models' instance segmentation by bringing language-vision prompts to queries/kernels. Our method consists of three parts: (1) Masks Proposal: Utilizes language-vision models to generate pseudo masks based on unlabeled images. (2) Prompt-Kernel Matching: Converts pseudo masks into prompts and injects the best-matched localization and shape features to their corresponding kernels. (3) Kernel Supervision: Formulates supervision for pre-training at the kernel level to ensure robust learning. With the help of our pre-training method, QEIS models can converge faster and perform better than CNN-based models in low-data regimes. Experimental evaluations conducted on MS COCO, Cityscapes, and CTW1500 datasets indicate that the QEIS models' performance can be significantly improved when pre-trained with our method. Code will be available at: https://github.com/lifuguan/UPLVP.

5/24/2024

cs.CV

👀

How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining

Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang

Masked reconstruction, which predicts randomly masked patches from unmasked ones, has emerged as an important approach in self-supervised pretraining. However, the theoretical understanding of masked pretraining is rather limited, especially for the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining. On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns, on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings, which is developed based on a careful analysis tracking the interplay between feature-wise and position-wise attention correlations.

6/6/2024

cs.LG stat.ML