Dataset of Quotation Attribution in German News Articles

Read original: arXiv:2404.16764 - Published 4/26/2024 by Fynn Petersen-Frey, Chris Biemann
Total Score

0

🧠

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents a new dataset for quotation attribution in German news articles, which is a crucial task in analyzing communication data.
  • The dataset is based on WIKINEWS and provides curated, high-quality annotations across 1000 documents (250,000 tokens) with a fine-grained annotation schema.
  • The annotations specify who said what, how, in which context, to whom, and the type of quotation, enabling various downstream uses.
  • The paper also describes suitable evaluation metrics, applies two existing systems for quotation attribution, and discusses their results to evaluate the utility of the dataset.

Plain English Explanation

When analyzing online communication data, such as news articles, it is important to understand who is saying what to whom. This process, called quotation attribution, is crucial for making sense of the conversations and interactions happening in the data.

However, there has been a lack of high-quality, annotated datasets for quotation attribution in the German language. To address this, the researchers in this paper have created a new dataset based on WIKINEWS, a German news platform. The dataset contains 1000 documents (around 250,000 words) with detailed annotations that specify:

  • Who said the quote
  • What they said
  • How they said it (e.g., sarcastically, angrily)
  • The context in which the quote was said
  • Who the quote was directed at
  • The type of quote (e.g., direct, indirect)

This comprehensive annotation schema allows the dataset to be used for a variety of downstream tasks, such as understanding communication patterns, identifying influential voices, and detecting biases in news reporting.

The researchers also provide evaluation metrics and apply two existing quotation attribution systems to the dataset, helping to assess its utility and identify areas for further improvement.

Overall, this new dataset fills an important gap in German language research and provides a valuable resource for anyone interested in analyzing human communication in text.

Technical Explanation

The paper presents a new dataset for quotation attribution in German news articles, which is a crucial task in analyzing communication data. The dataset is based on WIKINEWS and provides curated, high-quality annotations across 1000 documents (250,000 tokens) with a fine-grained annotation schema.

The annotations specify who said what, how, in which context, to whom, and the type of quotation (e.g., direct, indirect). This comprehensive annotation schema enables various downstream uses for the dataset, such as understanding communication patterns, identifying influential voices, and detecting biases in news reporting.

The paper also describes suitable evaluation metrics, such as precision, recall, and F1-score, for the quotation attribution task. The researchers apply two existing systems for quotation attribution, ExpertQA and QUATI, to the dataset and discuss their results. This helps to evaluate the utility of the dataset and identify areas for further improvement.

Critical Analysis

The researchers have addressed an important gap in German language research by creating a high-quality, annotated dataset for quotation attribution. The dataset's comprehensive annotation schema and large size (1000 documents, 250,000 tokens) make it a valuable resource for researchers and practitioners working on communication analysis tasks.

However, the paper does not provide detailed information on the inter-annotator agreement or the specific challenges faced during the annotation process. This information could have helped readers better understand the quality and reliability of the dataset.

Additionally, the paper could have discussed the potential biases or limitations of the WIKINEWS source material and how that might impact the dataset's representativeness or generalizability to other types of German news articles.

Overall, the dataset presented in this paper is a significant contribution to the field of quotation attribution and communication analysis in the German language. The researchers have provided a well-designed and thoroughly annotated resource that can be used for a variety of downstream tasks.

Conclusion

This paper presents a new, freely available dataset for quotation attribution in German news articles. The dataset provides high-quality, curated annotations across 1000 documents (250,000 tokens) with a fine-grained schema, specifying who said what, how, in which context, to whom, and the type of quotation.

The dataset fills an important gap in German language research and enables various downstream uses, such as understanding communication patterns, identifying influential voices, and detecting biases in news reporting. The paper also describes suitable evaluation metrics and applies two existing quotation attribution systems to the dataset, helping to assess its utility and identify areas for further improvement.

Overall, this new dataset is a valuable resource for researchers and practitioners interested in analyzing human communication in text, and it represents a significant contribution to the field of quotation attribution in the German language.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Total Score

0

Dataset of Quotation Attribution in German News Articles

Fynn Petersen-Frey, Chris Biemann

Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

Read more

4/26/2024

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations
Total Score

0

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

Haolin Deng, Chang Wang, Xin Li, Dezhang Yuan, Junlang Zhan, Tianhua Zhou, Jin Ma, Jun Gao, Ruifeng Xu

Enhancing the attribution in large language models (LLMs) is a crucial task. One feasible approach is to enable LLMs to cite external sources that support their generations. However, existing datasets and evaluation methods in this domain still exhibit notable limitations. In this work, we formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations. WebCiteS derives from real-world user queries and web search results, offering a valuable resource for model training and evaluation. Prior works in attribution evaluation do not differentiate between groundedness errors and citation errors. They also fall short in automatically verifying sentences that draw partial support from multiple sources. We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification. Our comprehensive evaluation of both open-source and proprietary models on WebCiteS highlights the challenge LLMs face in correctly citing sources, underscoring the necessity for further improvement. The dataset and code will be open-sourced to facilitate further research in this crucial field.

Read more

5/30/2024

AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
Total Score

0

AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Pia Pachinger, Janis Goldzycher, Anna Maria Planitzer, Wojciech Kusa, Allan Hanbury, Julia Neidhardt

Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned language models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox. We publish the data and code.

Read more

6/13/2024

NewsQs: Multi-Source Question Generation for the Inquiring Mind
Total Score

0

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

Read more

6/18/2024