STORYSUMM: Evaluating Faithfulness in Story Summarization

Read original: arXiv:2407.06501 - Published 7/10/2024 by Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia B. Chilton, Kathleen McKeown

STORYSUMM: Evaluating Faithfulness in Story Summarization

Overview

This paper introduces StorySumm, a new dataset for evaluating the faithfulness of story summarization models.
The authors argue that existing summarization datasets and metrics do not adequately capture the unique challenges of summarizing fictional stories.
StorySumm contains summaries of short stories annotated for factual consistency, coherence, and other aspects of faithfulness.

Plain English Explanation

The paper proposes a new dataset called StorySumm to help evaluate how well AI models can summarize fictional stories. Existing datasets and ways of measuring summarization quality don't work well for stories, which have unique challenges compared to other types of text. StorySumm provides summaries of short stories that have been carefully annotated to assess how faithful they are to the original - whether they accurately reflect the key facts, events, and overall coherence of the story. This can help researchers develop better AI models for summarizing creative, narrative content like stories.

Technical Explanation

The authors argue that existing summarization datasets and evaluation metrics, which are primarily focused on news articles and other non-fiction, do not capture the unique challenges of summarizing fictional stories. They introduce the StorySumm dataset, which contains summaries of short stories annotated along multiple dimensions of faithfulness, including factual consistency, coherence, and reflection of key narrative elements.

The dataset was created by having crowd workers read short stories and write their own summaries, which were then reviewed by expert annotators. The annotators assessed the summaries on a range of criteria, including whether they accurately reflected the main events, characters, and themes of the original story (faithfulness), whether they were logically coherent (coherence), and other aspects of narrative quality.

The StorySumm dataset can be used to train and evaluate story summarization models, with the goal of developing systems that can produce summaries that are not just concise and fluent, but truly faithful to the original story. This is a crucial capability for applications like creative writing assistance, automated story generation, and summarizing novels or other long-form fiction.

Critical Analysis

The StorySumm dataset and evaluation framework represent an important step forward in assessing the faithfulness of story summarization models. By focusing on the unique challenges of narrative text, the authors address a limitation of existing summarization benchmarks, which have primarily evaluated performance on non-fiction domains.

However, the paper does not delve into potential limitations or caveats of the StorySumm dataset and evaluation approach. For example, the dataset is relatively small, containing summaries of only 100 short stories. It's unclear how well the findings would generalize to longer, more complex fictional works. The authors also do not discuss potential biases in the crowd-sourced summaries or the expert annotations.

Additionally, the paper does not provide a comparative analysis of StorySumm against other proposed faithfulness evaluation methods, such as the FaithfulChartSumm approach for chat summarization. Further research is needed to understand how different faithfulness evaluation frameworks complement or compare to each other.

Conclusion

The StorySumm dataset and evaluation framework represent an important advancement in the field of text summarization, focusing on the unique challenges of summarizing fictional stories. By providing a way to assess the faithfulness of story summaries, this work can help drive the development of more capable and trustworthy AI systems for creative writing assistance, narrative generation, and summarization of long-form fiction. While the current implementation has some limitations, the core ideas presented in this paper lay the groundwork for further progress in this critical area of natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

STORYSUMM: Evaluating Faithfulness in Story Summarization

Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia B. Chilton, Kathleen McKeown

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

7/10/2024

💬

New!Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks

Eunice Akani, Benoit Favre, Frederic Bechet, Romain Gemignani

Dialogue summarization aims to provide a concise and coherent summary of conversations between multiple speakers. While recent advancements in language models have enhanced this process, summarizing dialogues accurately and faithfully remains challenging due to the need to understand speaker interactions and capture relevant information. Indeed, abstractive models used for dialog summarization may generate summaries that contain inconsistencies. We suggest using the semantic information proposed for performing Spoken Language Understanding (SLU) in human-machine dialogue systems for goal-oriented human-human dialogues to obtain a more semantically faithful summary regarding the task. This study introduces three key contributions: First, we propose an exploration of how incorporating task-related information can enhance the summarization process, leading to more semantically accurate summaries. Then, we introduce a new evaluation criterion based on task semantics. Finally, we propose a new dataset version with increased annotated data standardized for research on task-oriented dialogue summarization. The study evaluates these methods using the DECODA corpus, a collection of French spoken dialogues from a call center. Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.

9/17/2024

Leveraging Entailment Judgements in Cross-Lingual Summarisation

Huajian Zhang, Laura Perez-Beltrachini

Synthetically created Cross-Lingual Summarisation (CLS) datasets are prone to include document-summary pairs where the reference summary is unfaithful to the corresponding document as it contains content not supported by the document (i.e., hallucinated content). This low data quality misleads model learning and obscures evaluation results. Automatic ways to assess hallucinations and improve training have been proposed for monolingual summarisation, predominantly in English. For CLS, we propose to use off-the-shelf cross-lingual Natural Language Inference (X-NLI) to evaluate faithfulness of reference and model generated summaries. Then, we study training approaches that are aware of faithfulness issues in the training data and propose an approach that uses unlikelihood loss to teach a model about unfaithful summary sequences. Our results show that it is possible to train CLS models that yield more faithful summaries while maintaining comparable or better informativess.

8/2/2024

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Taiji Li, Zhi Li, Yin Zhang

Despite large language models (LLMs) have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.

8/1/2024