On Context Utilization in Summarization with Large Language Models

2310.10570

Published 6/17/2024 by Mathieu Ravaut, Aixin Sun, Nancy F. Chen, Shafiq Joty

💬

Abstract

Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.

Create account to get full access

Overview

Large language models (LLMs) excel at generating fluent and relevant summaries, even for long input contexts over 100,000 tokens.
However, LLMs exhibit a bias when answering questions, favoring information at the beginning and end of the input while underutilizing the middle.
This positional bias is concerning for summarization, where crucial content may be dispersed throughout the source document(s).
Mapping facts from the source to the summary is also challenging, as salient information is often rephrased.

Plain English Explanation

Large language models are AI systems that can understand and generate human-like text. They're particularly good at creating [object Object] that capture the key points from long documents or passages.

However, these models also have some limitations. When [object Object], they tend to focus more on information at the beginning and end of the input, while overlooking important details in the middle. This "positional bias" is problematic for summarization, where crucial content may be scattered throughout the source material.

Additionally, [object Object] is not a straightforward task. The language models often rephrase the salient information rather than directly copying it.

Technical Explanation

This paper presents the first comprehensive study on how well large language models utilize the context provided in their input for the task of summarization. The researchers analyzed 6 different LLMs across 10 datasets, using 5 evaluation metrics.

They found that the models exhibited a U-shaped performance pattern, where the models performed better at extracting information from the beginning and end of the input, while struggling to utilize the middle sections. This positional bias is concerning, as important content may be dispersed throughout source documents.

To address this issue, the researchers introduced a new evaluation benchmark called MiddleSum. They then tested two alternative inference methods: [object Object] and [object Object]. These approaches aim to improve the models' ability to utilize the full context and reduce the position bias.

Critical Analysis

The paper provides a thorough and insightful analysis of the context utilization and position bias exhibited by large language models in summarization tasks. The researchers' introduction of the MiddleSum benchmark is a valuable contribution, as it allows for more targeted evaluation of model performance.

However, the paper does not explore the underlying reasons for the position bias. It would be interesting to investigate whether this bias is a fundamental limitation of the language models' architecture or a result of the training data and methods used.

Additionally, while the proposed inference methods show promise, the paper does not delve deeply into their strengths, weaknesses, and how they compare to other potential solutions. Further research is needed to fully understand the effectiveness and practical implications of these approaches.

Conclusion

This paper sheds light on an important limitation of large language models in the context of summarization tasks. The models' tendency to favor information at the beginning and end of the input, while underutilizing the middle, raises concerns about their ability to capture crucial content that may be dispersed throughout source documents.

The researchers' introduction of the MiddleSum benchmark and exploration of alternative inference methods are important steps towards addressing this position bias. Continued research in this area could lead to significant improvements in the summarization capabilities of large language models, with far-reaching implications for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 15 percentage points. These findings open up future directions in understanding LLM attention bias and its potential consequences.

6/26/2024

cs.CL cs.AI cs.LG

🔗

Bias in News Summarization: Measures, Pitfalls and Corpora

Julius Steen, Katja Markert

Summarization is an important application of large language models (LLMs). Most previous evaluation of summarization models has focused on their content selection, faithfulness, grammaticality and coherence. However, it is well known that LLMs can reproduce and reinforce harmful social biases. This raises the question: Do biases affect model outputs in a constrained setting like summarization? To help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical operationalizations. Since we find that biases inherent to input documents can confound bias analysis in summaries, we propose a method to generate input documents with carefully controlled demographic attributes. This allows us to study summarizer behavior in a controlled setting, while still working with realistic input documents. We measure gender bias in English summaries generated by both purpose-built summarization models and general purpose chat models as a case study. We find content selection in single document summarization to be largely unaffected by gender bias, while hallucinations exhibit evidence of bias. To demonstrate the generality of our approach, we additionally investigate racial bias, including intersectional settings.

6/7/2024

cs.CL

🔄

Make Your LLM Fully Utilize the Context

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

4/29/2024

cs.CL cs.AI

Supervised Knowledge Makes Large Language Models Better In-context Learners

Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

4/12/2024

cs.CL cs.AI