ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Read original: arXiv:2403.17859 - Published 5/13/2024 by Bhawna Piryani, Jamshid Mozafari, Adam Jatowt

Overview

• This paper introduces ChroniclingAmericaQA, a large-scale question answering dataset based on historical American newspaper pages.

• The dataset is designed to support research in machine reading comprehension and question answering on real-world historical documents.

Plain English Explanation

ChroniclingAmericaQA is a new dataset that contains questions and answers based on the text from historical American newspaper pages. The goal of this dataset is to help advance research in machine reading and question answering systems, which aim to build AI models that can understand and respond to questions about written text.

Existing question answering datasets often use modern web pages or news articles, but the team behind this new dataset wanted to create something more challenging by using historical newspaper content. Newspapers from the past can have more complex language, outdated terminology, and references to events that may not be familiar to modern readers. Successfully answering questions about this kind of historical text requires a more sophisticated natural language understanding capability.

By providing researchers with this new dataset of questions and answers tied to digitized newspaper content, the hope is that it will spur the development of more advanced machine reading and question answering models that can handle real-world information sources, not just simple web pages. This could have important applications for using AI to explore and extract insights from large digital archives of historical documents.

Technical Explanation

The ChroniclingAmericaQA dataset was constructed using newspaper pages from the Chronicling America digital collection, which contains over 15 million pages of historic American newspapers. The team developed a pipeline to automatically generate question-answer pairs based on the text and metadata of these newspaper pages.

The process involved several steps:

Selecting relevant newspaper pages based on criteria like publication date, content, and language.
Applying optical character recognition (OCR) to extract the text from the digitized newspaper pages.
Generating questions about the content of the newspaper articles using techniques like cloze-style masks and paraphrasing.
Annotating the correct answers to the generated questions based on the newspaper text.

The final dataset contains over 400,000 question-answer pairs spanning a wide range of topics, including current events, historical information, sports, and more. The team evaluated the dataset using standard machine reading comprehension metrics and found it presented a significant challenge compared to existing QA benchmarks.

Critical Analysis

The ChroniclingAmericaQA dataset represents an important step forward in creating more realistic and challenging test beds for machine reading comprehension and question answering systems. By focusing on historical newspaper content, the dataset taps into a rich source of real-world information that has not been well-explored in existing QA datasets.

However, the authors acknowledge several limitations of the dataset. The quality of the OCR extraction can be imperfect, leading to noisy or incorrect text. Additionally, the automatically generated questions may not always align perfectly with the intent or nuance of the original newspaper content. Further human annotation and curation could help address these issues.

It would also be valuable to see the dataset expanded beyond just American newspapers, to include a broader diversity of historical text sources and languages. Expanding the scope could make the dataset even more useful for developing robust, multilingual question answering capabilities.

Overall, ChroniclingAmericaQA is a laudable effort to push the boundaries of machine reading comprehension research. While it has some room for improvement, it represents an important new benchmark that can accelerate progress in this critical area of AI development.

Conclusion

The ChroniclingAmericaQA dataset provides a valuable new resource for researchers working on machine reading comprehension and question answering systems. By focusing on digitized historical newspaper content, it presents a more challenging and realistic test bed compared to existing QA datasets.

Advances in being able to accurately understand and respond to questions about complex, real-world information sources like historical newspapers could have wide-ranging applications. This includes enhancing our ability to explore and extract insights from large digital archives, as well as building more capable AI assistants that can engage in substantive dialogues.

While the dataset has some limitations, it represents an important step forward in the field of question answering research. Continued work to expand and refine ChroniclingAmericaQA, as well as leveraging it to drive progress in machine reading comprehension, could yield significant benefits for both the AI research community and society at large.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Bhawna Piryani, Jamshid Mozafari, Adam Jatowt

Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.

5/13/2024

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

6/18/2024

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, Firoj Alam

Natural Question Answering (QA) datasets play a crucial role in developing and evaluating the capabilities of large language models (LLMs), ensuring their effective usage in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. In this study, we propose a scalable framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. Moreover, to demonstrate the efficacy of the proposed framework, we designed a multilingual natural QA dataset, MultiNativQA, consisting of ~72K QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers covering 18 topics. We benchmark the MultiNativQA dataset with open- and closed-source LLMs. We made both the framework NativQA and MultiNativQA dataset publicly available for the community. (https://nativqa.gitlab.io)

7/16/2024

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval

Juraj Vladika, Florian Matthes

In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.

4/15/2024