NoticIA: A Clickbait Article Summarization Dataset in Spanish

2404.07611

YC

0

Reddit

0

Published 6/3/2024 by Iker Garc'ia-Ferrero, Bego~na Altuna
NoticIA: A Clickbait Article Summarization Dataset in Spanish

Abstract

We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative summarizations written by humans. This task demands advanced text understanding and summarization abilities, challenging the models' capacity to infer and connect diverse pieces of information to meet the user's informational needs generated by the clickbait headline. We evaluate the Spanish text comprehension capabilities of a wide range of state-of-the-art large language models. Additionally, we use the dataset to train ClickbaitFighter, a task-specific model that achieves near-human performance in this task.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces NoticIA, a dataset of Spanish clickbait article summaries.
  • Clickbait articles are designed to attract readers with sensational or misleading headlines, but the actual content may not match the hype.
  • The NoticIA dataset can be used to train machine learning models to automatically summarize clickbait articles in a more concise and accurate way.

Plain English Explanation

The NoticIA dataset provides a valuable resource for developing natural language processing (NLP) models that can handle clickbait articles in Spanish. Clickbait is a common online phenomenon where article titles are designed to be eye-catching and enticing, but the actual content may not live up to the hype. This can be frustrating for readers who want accurate and concise information.

The NoticIA dataset contains hundreds of Spanish clickbait articles paired with human-written summaries that capture the key points without the exaggeration. By training NLP models on this dataset, researchers can develop systems that can automatically summarize clickbait articles in a more truthful and compact way. This could be helpful for online readers, news aggregators, and social media platforms that want to provide a better user experience by cutting through the clickbait.

Technical Explanation

The NoticIA dataset was created by crawling popular Spanish news websites and identifying clickbait articles based on their titles. Each article was then manually summarized by human annotators to create concise, accurate summaries. The resulting dataset contains 1,500 article-summary pairs that can be used to train machine learning models.

The authors evaluated several state-of-the-art text summarization models on the NoticIA dataset, including BART and T5. They found that fine-tuning these models on the NoticIA dataset led to significant improvements in summarizing Spanish clickbait articles compared to using the models out-of-the-box. This demonstrates the value of having a specialized dataset to train NLP systems for this particular task.

Critical Analysis

The NoticIA dataset provides a useful benchmark for evaluating Spanish clickbait summarization models, but it has some limitations. The dataset is relatively small, with only 1,500 articles, which may not be sufficient to train large, complex models. Additionally, the human-written summaries, while high-quality, may introduce some subjectivity and bias.

Future work could explore ways to expand the dataset, either by crawling more clickbait articles or using semi-supervised or unsupervised techniques to generate larger-scale training data. [Incorporating other datasets, such as those for product description QA or multilingual fake news detection, could also help improve the robustness and generalization of clickbait summarization models.

Conclusion

The NoticIA dataset represents an important contribution to the field of NLP, providing a valuable resource for developing systems that can effectively summarize Spanish clickbait articles. By training models on this dataset, researchers can create tools that help online readers quickly identify the key information in sensationalized content, ultimately improving the quality and trustworthiness of the information they consume.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Generating clickbait spoilers with an ensemble of large language models

Mateusz Wo'zny, Mateusz Lango

YC

0

Reddit

0

Clickbait posts are a widespread problem in the webspace. The generation of spoilers, i.e. short texts that neutralize clickbait by providing information that satisfies the curiosity induced by it, is one of the proposed solutions to the problem. Current state-of-the-art methods are based on passage retrieval or question answering approaches and are limited to generating spoilers only in the form of a phrase or a passage. In this work, we propose an ensemble of fine-tuned large language models for clickbait spoiler generation. Our approach is not limited to phrase or passage spoilers, but is also able to generate multipart spoilers that refer to several non-consecutive parts of text. Experimental evaluation demonstrates that the proposed ensemble model outperforms the baselines in terms of BLEU, METEOR and BERTScore metrics.

Read more

5/28/2024

🔍

Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Alena Tsanda, Elena Bruches

YC

0

Reddit

0

The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.

Read more

5/14/2024

⛏️

Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model

Yiming Chen, Haobin Chen, Simin Liu, Yunyun Liu, Fanhao Zhou, Bing Wei

YC

0

Reddit

0

With the continuous advancement of artificial intelligence, natural language processing technology has become widely utilized in various fields. At the same time, there are many challenges in creating Chinese news summaries. First of all, the semantics of Chinese news is complex, and the amount of information is enormous. Extracting critical information from Chinese news presents a significant challenge. Second, the news summary should be concise and clear, focusing on the main content and avoiding redundancy. In addition, the particularity of the Chinese language, such as polysemy, word segmentation, etc., makes it challenging to generate Chinese news summaries. Based on the above, this paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries. The experimental results show that the proposed method has a good effect on creating news summaries, which is of great importance to the construction of news summaries.

Read more

6/27/2024

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

Ran Zhang, Jihed Ouni, Steffen Eger

YC

0

Reddit

0

While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 instances for hDe-En (extended version with 455 instances) and 289 for hEn-De (extended version with 501 instances), leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task.

Read more

6/4/2024