Towards Unified Multi-granularity Text Detection with Interactive Attention

Read original: arXiv:2405.19765 - Published 5/31/2024 by Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

Towards Unified Multi-granularity Text Detection with Interactive Attention

Overview

This paper proposes a unified multi-granularity text detection model with interactive attention, which aims to address the challenge of detecting text at different scales and orientations in complex scenes.
The model leverages an interactive attention mechanism to dynamically adjust the receptive fields of the network, allowing it to capture text of varying sizes and orientations.
The authors evaluate their approach on several standard benchmarks and demonstrate state-of-the-art performance in multi-scale and multi-oriented text detection.

Plain English Explanation

The paper presents a new text detection model that can identify text of different sizes and orientations within complex images. Traditional text detection models often struggle with this task, as they are limited in their ability to adapt to varying text scales and orientations.

The key innovation of this model is an "interactive attention" mechanism, which allows the network to dynamically adjust its focus to better capture text at different scales and angles. This is like a human reader who can quickly shift their gaze to spot small, rotated text within a busy page.

By incorporating this interactive attention, the model is able to outperform other state-of-the-art approaches on standard benchmarks for multi-scale and multi-oriented text detection. This could enable more robust and versatile text-based applications, such as improved document analysis or better text-to-image alignment.

Technical Explanation

The proposed model uses a convolutional neural network backbone to extract visual features from the input image. These features are then passed through an interactive attention module, which dynamically adjusts the receptive fields of the network to better capture text at different scales and orientations.

The interactive attention mechanism consists of two components: a global attention module and a local attention module. The global attention module learns to assign importance weights to different regions of the feature map, highlighting areas that are more likely to contain text. The local attention module then refines these attention weights based on the local context, allowing the model to better localize individual text instances.

The authors evaluate their approach on several standard text detection benchmarks, including ICDAR 2015, SCUT-CTW1500, and Total-Text. They demonstrate state-of-the-art performance on both multi-scale and multi-oriented text detection tasks, outperforming previous methods by a significant margin.

Critical Analysis

The proposed interactive attention mechanism is a promising approach to addressing the challenges of multi-scale and multi-oriented text detection. By dynamically adjusting the receptive fields of the network, the model is able to better adapt to the diverse range of text instances that can appear in complex scenes.

However, the paper does not provide a detailed analysis of the limitations of the approach. For example, it is unclear how the model would perform on extremely small or heavily occluded text, or how it might scale to processing high-resolution images. Additionally, the computational complexity of the interactive attention module is not discussed, which could be an important consideration for real-world applications.

Furthermore, the paper could have provided a more in-depth comparison to other recent advances in text detection and document analysis, to better situate the contributions of the proposed approach within the broader context of the field.

Conclusion

This paper presents a novel multi-granularity text detection model with an interactive attention mechanism, which allows the network to dynamically adapt to text of varying scales and orientations. The authors demonstrate state-of-the-art performance on several benchmark datasets, highlighting the potential of this approach for more robust and versatile text-based applications.

While the paper could benefit from a more thorough discussion of the model's limitations and further contextualization within the broader field, the proposed interactive attention mechanism represents an interesting and valuable contribution to the ongoing research in text detection and document analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards Unified Multi-granularity Text Detection with Interactive Attention

Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce Detect Any Text (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including *word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT's accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.

5/31/2024

⚙️

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang, Wenxuan Xie, Cuiling Lan, Yan Lu, Nanning Zheng

Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.

5/14/2024

🔎

Aggregated Text Transformer for Scene Text Detection

Zhao Zhou, Xiangcheng Du, Yingbin Zheng, Cheng Jin

This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.

6/5/2024

AnyTrans: Translate AnyText in the Image with Large Scale Models

Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji

This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

6/18/2024