RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

Read original: arXiv:2403.20101 - Published 4/1/2024 by Benjamin Townsend, Madison May, Christopher Wells

⛏️

Introduction

The NLP community has produced many benchmark datasets for information extraction tasks over the years. However, these datasets often lack the realism of the complicated information extraction tasks performed by knowledge workers in enterprise settings. The authors present RealKIE, a new benchmark with five document-level key information extraction datasets that aim to address several key difficulties:

Poor document quality leading to OCR artifacts and poor text serialization
Sparse annotations within long documents causing class imbalance issues
Complex tabular layouts that must be considered to discriminate between similar labels
Varied data types to be extracted, from simple dates and prices to long-form clauses

The benchmark includes PDF documents, full OCR output, and text span annotations. The extracted fields are representative of real-world data extraction tasks in domains like accounts payable invoice processing and legal contract analysis. The authors hope this new benchmark will drive research into novel approaches to information extraction in real-world enterprise settings.

Dataset Descriptions

This section summarizes the datasets that compose RealKIE. Each dataset is described, including details on the documents, example elements from the full sequence labeling schema, and summary statistics.

The SEC S1 Filings dataset consists of 322 labeled S1 filings from the SEC's EDGAR data store. These documents have high variability in content and formatting. The labeling schema is meant to mimic the activities of an investment analyst assessing an offering.

The US Non-Disclosure Agreements (NDA) dataset contains 439 non-disclosure agreements submitted to EDGAR. This dataset includes manually-labeled text span annotations.

The UK Charity Reports dataset contains 538 public annual reports filed by charities in the UK. The schema extends the Kleister-Charities dataset and includes fields capturing information about charity activities and personnel.

The FCC Invoices dataset consists of 370 labeled invoices containing cost information from political campaign advertisements. These documents have a mixture of document-level, line-level, and summary information, presenting challenges in annotation and modeling.

The Resource Contracts dataset contains 198 labeled legal contracts for resource exploration and exploitation. These documents have varied formats, visual quality, and ways of presenting information, making consistent labeling difficult.

The document length statistics for each dataset are provided in Table 6.

Document Processing

The provided text describes the document processing pipeline used to create the dataset. Each document enters as a PDF and is converted to images and processed by an optical character recognition (OCR) engine. The documents go through an OCR process for consistency, and the OCR files, images, and original files are all shared as part of the dataset. Duplicate text was removed.

Two different pipelines are used to process the documents. The OmniPage pipeline uses OmniPage for both OCR and PDF-to-PNG conversion. The Azure Read OCR Pipeline uses the Azure Computer Vision Read API for OCR and PyPDFium for PDF-to-PNG conversion. Both pipelines apply rotation and de-skewing based on the OCR engine outputs.

OmniPage was used for all datasets except Resource Contracts. OmniPage provides consistent OCR output for clean scans or native PDFs, while Azure's Read OCR handled the shading and partial occlusion in the Resource Contracts files better. This document processing workflow plays an important role in the dataset preparation process, establishing consistency for subsequent stages.

Description of Annotation Task

The paper describes the annotation process used for the RealKIE dataset. The majority of the annotation process is shared across the RealKIE datasets.

Prior to annotation, a set of slides was created to detail annotation expectations, including describing the intent of each label, providing positive examples, and documenting counter-examples. During the annotation process, these slides were amended as needed for clarification.

The annotation was performed using a commercial annotation interface that provided a PDF-like UI for applying labels via a highlighting tool. This approach removed ambiguities that may have been introduced by OCR.

The annotation process consisted of three main phases:

Initial annotation: 5-10 documents were annotated by the person who developed the labeling guide, to test the guide and refine the schema.
Model-assisted annotation: After the first 50 documents, a token-classification model was automatically trained and used to provide predictions to the annotators, who could accept, reject, or turn off the predictions.
Quality review: A model was trained on the annotated dataset to identify disagreements between the model predictions and the annotations. These disagreements were then manually reviewed.

The paper notes that in an industry setting, each document is seen by only one annotator, so metrics like inter-annotator agreement are not available.

Baseline Procedure and Results

The paper describes the baseline models used for the RealKIE (Real-world Key Information Extraction) task. The authors finetuned several different pretrained transformer models using a token classification approach. The baseline models used are RoBERTa-base, DeBERTa-v3-base, XDoc-base, LayoutLM-v3-base, and Longformer-base.

The models were trained using two different codebases - Hugging Face Transformers for RoBERTa, DeBERTa, Longformer, and LayoutLM, and the Finetune Library for XDoc and a comparison RoBERTa run. The authors conducted a Hyperband Bayesian hyperparameter search and selected the model with the highest validation F1 score.

For training on long documents with sparse labels, the authors chunked the documents to match the context size of the model. They also utilized undersampling of chunks without labels to improve recall and stabilize the loss. The Finetune Library includes an "Auto Negative Sampling" feature for hard-negative mining.

The authors provide details on the sweep parameters used for each model and library in Table 7. Table 8 summarizes the key characteristics of the baseline models.

The authors estimate the aggregate equivalent CO2 emissions from running the baselines to be 766 kg, but believe this impact is justified by producing reliable baselines for future work. The full code and scripts will be shared shortly.

Analysis

The paper provides a brief analysis of the baseline results on the RealKIE dataset, highlighting the challenges outlined in Section 1. The analysis focuses on three key aspects:

Complex Layout and Text Serialization Issues:
- The datasets have layout components that are likely important for the task.
- While models like LayoutLM and XDoc use 2D positional features to improve performance on layout-rich documents, they underperform text-only models for most datasets, except Charities.
- The paper invites further work to determine if the datasets do not require positional features or if the current base models are unable to exploit this property.
Sparse Annotations and Class Imbalance:
- The RealKIE datasets exhibit two primary modes of class imbalance: label sparsity and class imbalance.
- The paper's baselines include approaches like class weighting, auto-negative-sampling, and random-negative-sampling to address these issues.
- The best-performing RoBERTa models use class weights and auto-negative-sampling, with the exception of the FCC Invoices dataset, which has no chunks without labels.
- Finetune RoBERTa outperforms Hugging Face RoBERTa on 4 out of 5 datasets, likely due to the handling of imbalances.
Context Length:
- The average length of the datasets exceeds the 512-token context length used by most baseline models.
- Comparing RoBERTa-base and Longformer-base, the paper finds that Longformer, with its extended context length, outperforms RoBERTa in 4 out of 5 cases, suggesting that context length is advantageous for these datasets.

In summary, the paper highlights the challenges of complex layout, sparse annotations, class imbalance, and limited context length in the RealKIE datasets, and invites future work to address these issues.

Conclusions

The paper introduced RealKIE, a new benchmark for document datasets that reflect the challenges knowledge workers face when automating data extraction. The datasets include:

Poor document quality leading to OCR artifacts and poor text serialization
Sparse annotations within long documents causing class imbalance issues
Complex tabular layout that must be considered to discriminate between similar labels
Varied data types to be extracted, from simple dates and prices to long-form clauses

The paper's baselines indicate that existing methods effectively leverage characteristics like long-context, class balance, and label sparsity. However, the models struggle when it comes to leveraging layout information. The authors state that models or frameworks that can improve upon this benchmark by being robust to these common difficulties would represent a major step forward in real-world information extraction technologies. The paper presents RealKIE as a reusable test bed for such advancements.

Acknowledgments

The text summarizes the best-performing parameters for different datasets and models, including Longformer Base, LayoutLM V3 Base, DeBERTa V3 Base, RoBERTa Base (Finetuned and Hugging Face), and XDoc Base. It acknowledges the substantial effort expended by the labeling team in producing high-quality labels for these difficult datasets. The key parameters listed include dataset, F1 score, Auto Negative Sampling, Max Empty Chunk Ratio, Learning Rate, Batch Size, Num Epochs, Class Weights, LR Warmup, Collapse Whitespace, Max Grad Norm, L2 Regularization, Gradient Accumulation Steps, and LR Schedule.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⛏️

RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

Benjamin Townsend, Madison May, Christopher Wells

We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.

4/1/2024

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

Oshri Naparstek, Roi Pony, Inbar Shapira, Foad Abo Dahood, Ophir Azulai, Yevgeny Yaroker, Nadav Rubinstein, Maksym Lysak, Peter Staar, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Elad Amrani, Idan Friedman, Orit Prince, Yevgeny Burshtein, Adi Raz Goldfarb, Udi Barzelay

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

5/2/2024

KET-QA: A Dataset for Knowledge Enhanced Table Question Answering

Mengkang Hu, Haoyu Dong, Ping Luo, Shi Han, Dongmei Zhang

Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: https://ketqa.github.io/.

5/15/2024

BuDDIE: A Business Document Dataset for Multi-task Information Extraction

Ran Zmigrod, Dongsheng Wang, Mathieu Sibue, Yulong Pei, Petr Babkin, Ivan Brugere, Xiaomo Liu, Nacho Navarro, Antony Papadimitriou, William Watson, Zhiqiang Ma, Armineh Nourbakhsh, Sameena Shah

The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.

4/8/2024