From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization

2404.03555

Published 4/15/2024 by Botond Barta, Dorina Lakatos, Attila Nagy, Mil'an Konor Nyist, Judit 'Acs

Abstract

Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence similarity. We train baseline models for both extractive and abstractive summarization using the collected dataset. To demonstrate the effectiveness of the trained models, we perform both quantitative and qualitative evaluation. Our dataset, models and code are publicly available, encouraging replication, further research, and real-world applications across various domains.

Create account to get full access

Overview

This paper describes the creation of a new dataset for Hungarian language summarization, covering both extractive and abstractive summarization tasks.
The dataset consists of news articles and their corresponding human-written summaries, allowing for the development and evaluation of summarization models.
The authors provide an analysis of the dataset characteristics and discuss its potential applications in advancing Hungarian language processing research.

Plain English Explanation

The researchers have created a new dataset that can be used to train and test computer systems for summarizing Hungarian news articles. Summarization is the task of taking a longer piece of text, like a news article, and generating a shorter, concise version that captures the key points.

The dataset contains a collection of Hungarian news articles along with human-written summaries for each article. Humans are very good at summarizing, so these human-written summaries can serve as a benchmark for evaluating how well computer systems can perform the summarization task.

Having a dataset like this is important because it allows researchers to develop and improve summarization models for the Hungarian language. Summarization is a valuable capability, as it can help people quickly understand the main ideas in a long document without having to read the entire thing. This new Hungarian dataset provides an important resource for advancing research in this area.

Technical Explanation

The dataset was constructed by crawling news articles from several major Hungarian media outlets and crowdsourcing human-written summaries for a subset of the articles. The authors analyzed the dataset characteristics, finding that the summaries tend to be about 20-25% the length of the original articles on average.

The dataset includes two types of summarization tasks: extractive summarization, where the model selects and extracts the most important sentences from the original text, and abstractive summarization, where the model generates novel summary text that may not directly match any of the original sentences.

The authors provide baseline results for both extractive and abstractive summarization using established models, demonstrating the usefulness of the dataset for evaluating and advancing Hungarian language summarization systems. They note that the dataset can also be used to study other language processing tasks, such as topic modeling and text classification.

Critical Analysis

The authors acknowledge several limitations of the dataset, including potential biases in the article selection and summary writing process. They also note that the dataset size, while substantial, may not be large enough to fully train and evaluate the most complex summarization models.

Additionally, the authors do not provide extensive details on the human evaluation process used to assess the quality of the summaries. More information on the criteria and methodology used would help readers better understand the reliability of the human-written summaries as a gold standard.

Future research could explore ways to expand the dataset, either by collecting more articles and summaries or by incorporating additional metadata or annotation layers. Investigating how the dataset performs on cross-lingual summarization tasks could also be an interesting avenue to pursue.

Conclusion

This paper presents a valuable new resource for Hungarian language processing research, specifically in the domain of text summarization. The dataset provides the necessary data to develop and evaluate both extractive and abstractive summarization models, which can have practical applications in areas like news consumption, document management, and language learning.

By making this dataset publicly available, the authors have taken an important step forward in advancing the state of the art in Hungarian natural language processing. Continued research and innovation using this dataset has the potential to significantly improve the ability of computers to understand and summarize Hungarian text, benefiting a wide range of users and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤖

HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew

Tzuf Paz-Argaman, Itai Mondshine, Asaf Achi Mordechai, Reut Tsarfaty

While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.

6/11/2024

cs.CL cs.AI

SumHiS: Extractive Summarization Exploiting Hidden Structure

Tikhonov Pavel, Anastasiya Ianina, Valentin Malykh

Extractive summarization is a task of highlighting the most important parts of the text. We introduce a new approach to extractive summarization task using hidden clustering structure of the text. Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summaries than both extractive and abstractive methods, achieving state-of-the-art results in terms of ROUGE-2 metric exceeding the previous approaches by 10%. Additionally, we show that hidden structure of the text could be interpreted as aspects.

6/13/2024

cs.CL

Unsupervised Extractive Dialogue Summarization in Hyperdimensional Space

Seongmin Park, Kyungho Kim, Jaejin Seo, Jihwa Lee

We present HyperSum, an extractive summarization framework that captures both the efficiency of traditional lexical summarization and the accuracy of contemporary neural approaches. HyperSum exploits the pseudo-orthogonality that emerges when randomly initializing vectors at extremely high dimensions (blessing of dimensionality) to construct representative and efficient sentence embeddings. Simply clustering the obtained embeddings and extracting their medoids yields competitive summaries. HyperSum often outperforms state-of-the-art summarizers -- in terms of both summary accuracy and faithfulness -- while being 10 to 100 times faster. We open-source HyperSum as a strong baseline for unsupervised extractive summarization.

5/17/2024

cs.CL cs.AI

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

Ran Zhang, Jihed Ouni, Steffen Eger

While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 instances for hDe-En (extended version with 455 instances) and 289 for hEn-De (extended version with 501 instances), leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task.

6/4/2024

cs.CL