TANQ: An open domain dataset of table answered questions

2405.07765

Published 5/14/2024 by Mubashara Akhtar, Chenxi Pang, Andreea Marzoca, Yasemin Altun, Julian Martin Eisenschlos

👁️

Abstract

Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.

Create account to get full access

Overview

This paper introduces TANQ, the first open-domain question answering dataset where the answers require building tables from information across multiple sources.
The authors provide full source attribution for every cell in the resulting tables and benchmark state-of-the-art language models in open, oracle, and closed book setups.
The best-performing baseline, GPT4, reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
The paper analyzes the baselines' performance across different dataset attributes, including multi-hop reasoning, math operations, and unit conversions.
The authors discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.

Plain English Explanation

In this paper, the researchers introduce a new dataset called TANQ (Table-based Question Answering) for evaluating the ability of language models to answer questions that require retrieving information from multiple sources, processing and combining that data, and presenting the results in the form of a structured table.

Unlike traditional question-answering tasks, where the model is given a single source of information to draw from, TANQ requires the model to gather relevant details from various sources, perform any necessary calculations or unit conversions, and then organize the findings into a coherent table. This more closely mimics the real-world challenges of answering complex queries that may not be easily answered by a single document or website.

The researchers tested several state-of-the-art language models, including the impressive GPT4, on the TANQ dataset. While these models performed reasonably well, they still fell short of human-level performance by a significant margin, suggesting that building table-based answers from multiple sources remains a challenging task for current AI systems.

The paper dives into the specific skills required for this task, such as multi-hop reasoning (drawing insights from multiple steps of information gathering), math operations, and unit conversions. By analyzing the models' strengths and weaknesses in these areas, the researchers hope to identify key areas for improvement in the development of more capable question-answering systems.

Overall, the TANQ dataset and the insights from this study represent an important step forward in benchmarking the real-world capabilities of language models and pushing the boundaries of what they can achieve in terms of answering complex, multi-faceted questions.

Technical Explanation

In this paper, the authors introduce TANQ, the first open-domain question answering dataset where the answers require building tables from information across multiple sources. The dataset provides full source attribution for every cell in the resulting tables, allowing for a more comprehensive evaluation of model performance.

The authors benchmark state-of-the-art language models, including GPT4, in open, oracle, and closed book setups. In the open setting, models have access to external information sources, while in the oracle setting, they are given the relevant sources upfront. The closed book setup tests the models' ability to answer questions without any external information.

The best-performing baseline, GPT4, achieves an overall F1 score of 29.1, which is 19.7 points lower than human performance on the same task. The authors analyze the models' performance across different dataset attributes, such as multi-hop reasoning, math operations, and unit conversions, to identify the specific challenges posed by this task.

The paper also discusses common failures in model-generated answers, suggesting that TANQ is a complex task with many hurdles that current language models struggle to overcome. The authors highlight the need for further advancements in areas like FREB-TQA, TableVQA, UQA, and KazQAD to push the boundaries of question-answering systems and their ability to handle complex, real-world scenarios.

Critical Analysis

While the TANQ dataset and the insights from this study represent an important step forward in benchmarking the capabilities of language models, the authors acknowledge several limitations and areas for further research.

One key challenge is the inherent complexity of the task, which requires models to not only retrieve relevant information from multiple sources but also perform various reasoning and calculation steps to arrive at the final table-based answer. This places a significant cognitive load on the models and may be beyond the current capabilities of even the most advanced language models.

Additionally, the authors note that the performance of the models could be heavily influenced by the quality and coverage of the information sources provided. If the relevant data is not present or is incomplete in the given sources, the models may struggle to construct a satisfactory answer, even with strong language understanding and reasoning abilities.

Further research is needed to explore techniques for better integrating external information retrieval, data processing, and tabular output generation within a single end-to-end framework. Advances in areas like FREB-TQA, TableVQA, UQA, and KazQAD may help inform the development of more capable question-answering systems that can handle the complexities of real-world scenarios.

Conclusion

The introduction of the TANQ dataset and the benchmarking of state-of-the-art language models on this task represent a significant advancement in the field of question-answering. The ability to retrieve information from multiple sources, process and aggregate data, and present findings in a structured tabular format is a critical skill for AI systems to possess in order to effectively assist humans in real-world problem-solving.

While the current performance of the tested models, including the impressive GPT4, falls short of human-level capabilities, the insights gained from this study can help guide future research and development efforts. By understanding the specific challenges posed by TANQ, such as multi-hop reasoning, math operations, and unit conversions, researchers can work to address these limitations and push the boundaries of what is possible in the realm of open-domain question-answering.

As language models and other AI systems continue to evolve, the TANQ dataset and the lessons learned from this paper will play an important role in ensuring that these technologies are able to truly understand and assist humans in navigating the complex and multifaceted challenges of the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌿

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

4/29/2024

cs.CL

KET-QA: A Dataset for Knowledge Enhanced Table Question Answering

Mengkang Hu, Haoyu Dong, Ping Luo, Shi Han, Dongmei Zhang

Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: https://ketqa.github.io/.

5/15/2024

cs.CL

🏋️

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

4/30/2024

cs.CL

🤔

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

Yoonsik Kim, Moonbin Yim, Ka Yeon Song

In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}.

5/1/2024

cs.CV cs.AI