Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Read original: arXiv:2406.19102 - Published 6/28/2024 by Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Overview

This paper explores using large language models (LLMs) to extract information from tables and generate key performance indicators (KPIs) related to environmental, social, and governance (ESG) metrics.
The authors propose a novel approach called "Statements" that leverages the impressive language understanding capabilities of LLMs to extract structured information from unstructured table data.
The Statements system is evaluated on a variety of table-based datasets, including those focused on ESG reporting, and is shown to outperform existing information extraction methods.

Plain English Explanation

The paper discusses a new way to automatically extract important information from tables using advanced AI language models. Tables are a common way to present data, but they can be difficult for computers to understand and extract useful insights from. The researchers developed a system called "Statements" that uses large language models - powerful AI systems trained on massive amounts of text data - to analyze the contents of tables and generate summarized information, like key performance indicators (KPIs) related to environmental, social, and governance (ESG) factors.

ESG reporting is an increasingly important area, as investors and the public want to understand the non-financial impacts of companies. However, manually extracting relevant ESG data from company reports and filings can be time-consuming and error-prone. The Statements system aims to automate this process by using AI to quickly parse through tables of information and identify the most salient metrics and insights.

The researchers tested their Statements approach on a variety of real-world table datasets, including ones focused on ESG data, and found that it outperformed other information extraction methods. This suggests that leveraging large language models can be an effective way to unlock the valuable data trapped in tabular formats, with applications in areas like financial analysis, sustainability reporting, and more.

Technical Explanation

The core of the Statements system is a large language model that has been fine-tuned on a dataset of tables and their corresponding extracted information. This allows the model to learn the patterns and structures inherent in tabular data, as well as how to map that information into concise, informative "statements."

During inference, the Statements model takes a table as input and generates a set of output statements that summarize the key facts and metrics contained in the table. This is done through a multi-step process:

Table Understanding: The LLM analyzes the table structure, headers, and content to build an internal representation of the information.
Statement Generation: The model then generates natural language statements that capture the most salient data points and insights from the table.
Post-processing: Finally, the generated statements are filtered and refined to ensure they are accurate, concise, and directly relevant to the table contents.

The researchers evaluate Statements on a range of datasets, including those focused on ESG reporting, and compare its performance to baseline information extraction methods. The results show that Statements is able to outperform these baselines, demonstrating the power of large language models for extracting structured knowledge from unstructured tabular data.

Critical Analysis

The Statements approach represents an impressive application of large language models to the challenge of table understanding and information extraction. By leveraging the impressive language understanding capabilities of LLMs, the system is able to go beyond traditional rule-based or pattern-matching methods and generate more nuanced, contextual insights from table data.

However, the paper also acknowledges some limitations of the current Statements system. For example, the model may struggle with tables that have complex structures or unconventional layouts, and the quality of the generated statements can be sensitive to the specific fine-tuning data and hyperparameters used.

Additionally, while the Statements system shows strong performance on the evaluated datasets, it would be valuable to see how it generalizes to a wider range of real-world table types and use cases. Further research could also explore ways to make the system more robust, transparent, and controllable, such as by incorporating user feedback or providing explanations for the generated statements.

Overall, the Statements work represents an important step forward in using large language models for information extraction from tables, with promising applications in areas like ESG KPI extraction, table-based question answering, and table understanding more broadly. As the field of language model capabilities and limitations continues to evolve, techniques like Statements will likely play an increasingly important role in unlocking the value of structured data.

Conclusion

This paper introduces Statements, a novel system that uses large language models to extract structured information and generate relevant insights from tables, with a particular focus on environmental, social, and governance (ESG) reporting. The Statements approach leverages the impressive language understanding capabilities of LLMs to analyze table contents and produce concise, informative statements that capture the key data points and metrics.

The researchers demonstrate the effectiveness of Statements through extensive evaluation on a variety of real-world table datasets, showing that it outperforms traditional information extraction methods. This work represents an important advancement in using AI to unlock the value of tabular data, with applications in fields like financial analysis, sustainability reporting, and knowledge-enhanced question answering.

While the Statements system has some limitations, this research highlights the exciting potential of large language models for tasks beyond just natural language processing. As the field continues to evolve, techniques like Statements will likely play an increasingly important role in bridging the gap between unstructured and structured data, helping organizations and individuals extract greater insights from the wealth of information contained in tables and other formats.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar

Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

6/28/2024

⛏️

Schema-Driven Information Extraction from Heterogeneous Tables

Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze, Alan Ritter

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.

7/24/2024

Uncovering Limitations of Large Language Models in Information Seeking from Tables

Chaoxu Pang, Yixuan Cao, Chunhao Yang, Ping Luo

Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable benchmark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other proprietary and open-source models perform inadequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS performance and robustness against pseudo-relevant tables (common in retrieval-augmented systems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.

6/7/2024

TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition

Md Mahadi Hasan Nahid, Davood Rafiei

Table reasoning is a challenging task that requires understanding both natural language questions and structured tabular data. Large language models (LLMs) have shown impressive capabilities in natural language understanding and generation, but they often struggle with large tables due to their limited input length. In this paper, we propose TabSQLify, a novel method that leverages text-to-SQL generation to decompose tables into smaller and relevant sub-tables, containing only essential information for answering questions or verifying statements, before performing the reasoning task. In our comprehensive evaluation on four challenging datasets, our approach demonstrates comparable or superior performance compared to prevailing methods reliant on full tables as input. Moreover, our method can reduce the input context length significantly, making it more scalable and efficient for large-scale table reasoning applications. Our method performs remarkably well on the WikiTQ benchmark, achieving an accuracy of 64.7%. Additionally, on the TabFact benchmark, it achieves a high accuracy of 79.5%. These results surpass other LLM-based baseline models on gpt-3.5-turbo (chatgpt). TabSQLify can reduce the table size significantly alleviating the computational load on LLMs when handling large tables without compromising performance.

4/17/2024