ClusterTabNet: Supervised clustering method for table detection and table structure recognition

Read original: arXiv:2402.07502 - Published 5/24/2024 by Marek Polewczyk, Marco Spinaci

👨‍🏫

Overview

The researchers present a novel deep learning-based method for clustering words in documents to detect and recognize tables from OCR output.
Their approach interprets table structure as a graph of relations between words, and uses a transformer encoder model to predict the adjacency matrix of this graph.
The method is evaluated on several datasets and achieves similar or better accuracy compared to state-of-the-art detection methods, while using a significantly smaller model.

Plain English Explanation

The paper describes a new way to automatically detect and understand the structure of tables in scanned documents. The key idea is to interpret the table as a graph - with words as the nodes and the relationships between them (like which words are in the same row or column) as the edges. A deep learning model is then used to predict this graph structure from the text in the document.

This graph-based approach allows the system to understand the overall layout and organization of the table, not just detect its boundaries. The researchers show that this method performs as well or better than existing table detection techniques, while using a much smaller and simpler model. This could make the system more efficient and practical to deploy, for example in document digitization workflows.

Technical Explanation

The core of the researchers' approach is to model the table structure as a graph, where the words in the document are the nodes and the relationships between them (e.g. belonging to the same row, column, or header) are the edges. They use a transformer encoder model to predict the adjacency matrix of this graph from the OCR output of the document.

The transformer model takes the sequence of words as input and outputs a predicted adjacency matrix, which encodes the inferred table structure. This allows the system to not just locate the table boundaries, but understand the higher-level organization of the table contents.

The researchers evaluate their method on several benchmark datasets for table detection and recognition, including PubTables-1M, PubTabNet, and FinTabNet. They compare the performance to state-of-the-art object detection methods like DETR and Faster R-CNN, and show that their graph-based approach achieves similar or better accuracy, while using a much smaller and more efficient model.

Critical Analysis

The paper makes a compelling case for the graph-based approach to table understanding, demonstrating strong performance on standard benchmarks. However, the researchers acknowledge some limitations in their analysis.

For example, the transformer model they use relies on full-page OCR, which may not be available in all real-world scenarios. Further research into more efficient OCR integration could help broaden the applicability of the method.

Additionally, the datasets used for evaluation may not fully capture the diversity of table structures found in real-world documents. Applying the method to a broader range of table types, including more complex layouts and formats, could uncover additional challenges or areas for improvement.

Overall, the graph-based table understanding technique represents a promising direction, but further research is needed to assess its robustness and generalizability in practical document digitization use cases.

Conclusion

The researchers have developed a novel deep learning-based approach to table detection and recognition that models the table structure as a graph. This graph-based method achieves state-of-the-art performance on benchmark datasets while using a significantly smaller and more efficient model.

The key innovation is the use of a transformer encoder to predict the adjacency matrix of the table graph, allowing the system to capture the higher-level organization and layout of the table contents. This advances the state of the art in document understanding and could have important applications in areas like document digitization and data extraction from tabular data.

While the results are promising, further research is needed to assess the robustness and generalizability of the approach. Nonetheless, this work represents an important step forward in the field of table understanding and document analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

Marek Polewczyk, Marco Spinaci

We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.

5/24/2024

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

Zhenrong Zhang, Shuhang Liu, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Yu Hu

In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model's focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model's capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.

9/23/2024

🔎

End-to-End Semi-Supervised approach with Modulated Object Queries for Table Detection in Documents

Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and Non-Maximum Suppression (NMS) in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.

5/14/2024

TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content

Avinash Anand, Raj Jaiswal, Pijush Bhuyan, Mohit Gupta, Siddhesh Bangar, Md. Modassir Imam, Rajiv Ratn Shah, Shin'ichi Satoh

The automatic recognition of tabular data in document images presents a significant challenge due to the diverse range of table styles and complex structures. Tables offer valuable content representation, enhancing the predictive capabilities of various systems such as search engines and Knowledge Graphs. Addressing the two main problems, namely table detection (TD) and table structure recognition (TSR), has traditionally been approached independently. In this research, we propose an end-to-end pipeline that integrates deep learning models, including DETR, CascadeTabNet, and PP OCR v2, to achieve comprehensive image-based table recognition. This integrated approach effectively handles diverse table styles, complex structures, and image distortions, resulting in improved accuracy and efficiency compared to existing methods like Table Transformers. Our system achieves simultaneous table detection (TD), table structure recognition (TSR), and table content recognition (TCR), preserving table structures and accurately extracting tabular data from document images. The integration of multiple models addresses the intricacies of table recognition, making our approach a promising solution for image-based table understanding, data extraction, and information retrieval applications. Our proposed approach achieves an IOU of 0.96 and an OCR Accuracy of 78%, showcasing a remarkable improvement of approximately 25% in the OCR Accuracy compared to the previous Table Transformer approach.

4/22/2024