On Learning to Summarize with Large Language Models as References

Read original: arXiv:2305.14239 - Published 7/19/2024 by Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan

💬

Overview

Researchers have found that summaries generated by large language models (LLMs) are preferred by human annotators over the original reference summaries in common text summarization datasets.
This study explores using LLMs as a reference for training smaller text summarization models, investigating whether their performance can be substantially improved.
The researchers use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs' supervision signals.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a variety of topics. Recent studies have found that when these LLMs are used to generate summaries of text, people often prefer the LLM-generated summaries over the "official" reference summaries that were created by humans.

In this research, the scientists wanted to see if they could use these LLM-generated summaries as a guide to help train smaller text summarization models to perform better. They did this in two ways:

Supervised Fine-Tuning: They used the LLM-generated summaries as the "correct" answers to train their smaller summarization models.
Contrastive Learning: They used the LLM-generated summaries to provide feedback and guidance to their smaller models during the training process, helping them learn what makes a good summary.

The researchers found that the summarization models trained using the LLM-generated summaries as a reference were significantly better at generating high-quality summaries, as evaluated by both the LLMs themselves and by human judges.

Interestingly, the contrastive learning approach, which used the LLM-generated summaries to give feedback, worked better than the standard supervised fine-tuning, especially when the researchers had limited training data.

This research suggests that leveraging the capabilities of large language models could be a powerful way to improve the performance of smaller summarization models. However, the study also found that the LLMs are not perfect evaluators of summary quality, and there are still some nuanced differences between the LLM-generated summaries and the ones preferred by human judges.

Technical Explanation

The researchers conducted a series of experiments to investigate whether using LLMs as a reference can substantially improve the performance of smaller text summarization models.

First, they used the LLM-generated summaries as "ground truth" during the standard supervised fine-tuning of their summarization models. This allowed the models to learn from the high-quality summaries produced by the LLMs.

Second, they developed a contrastive learning approach that used the LLM-generated summaries as an "oracle" to provide guidance and feedback to the summarization models during training. This helped the models learn what characteristics make a good summary, as defined by the LLMs.

The researchers evaluated the summarization models trained with these LLM-based approaches on both LLM-based metrics and human evaluations. They found that the models trained using the LLM-as-reference setting achieved significant performance improvements compared to baseline models.

Notably, the contrastive learning approach outperformed the standard supervised fine-tuning, especially in low-resource settings where less training data was available. This suggests that leveraging the capabilities of large language models can be particularly beneficial when training data is limited.

The researchers also conducted a meta-analysis of the LLMs' summary evaluation capacities, revealing that the LLMs are not well-aligned with human evaluators when it comes to assessing summary quality. Their expert human evaluation uncovered remaining nuanced performance gaps between the LLM-generated summaries and the fine-tuned models that the LLMs failed to capture.

Critical Analysis

The researchers acknowledge that while their findings suggest the potential of using LLMs as a reference for training smaller summarization models, there are still challenges and limitations to this approach.

The inconsistencies and biases in how LLMs evaluate summary quality compared to human judgments highlight the need for further research into understanding and improving the summarization capabilities of large language models.

Additionally, the researchers note that their experiments were conducted on news article datasets, and it remains to be seen how well the LLM-as-reference approach would generalize to other types of text or domains. Exploring the applicability of this technique in diverse settings would be an important area for future work.

Overall, this research provides valuable insights into the potential of leveraging powerful LLMs to enhance the performance of smaller text summarization models. However, it also highlights the need for continued investigation into the capabilities and limitations of LLMs as both summary generators and evaluators.

Conclusion

This study demonstrates that using large language models (LLMs) as a reference can significantly improve the performance of smaller text summarization models, both in terms of LLM-based metrics and human evaluations.

The researchers found that training summarization models using LLM-generated summaries as "ground truth" during supervised fine-tuning, as well as using LLMs to provide guidance and feedback through contrastive learning, led to substantial improvements in summary quality.

While the findings suggest the potential of leveraging LLMs to enhance text summarization, the research also reveals that LLMs are not perfect evaluators of summary quality and may miss nuanced differences that are important to human judges.

This work calls for further studies into both the capabilities and limitations of large language models in the context of text summarization, as well as the development of techniques that can effectively harness the power of these models to improve the performance of smaller, more specialized summarization systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

On Learning to Summarize with Large Language Models as References

Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan

Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs' supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs' summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development.

7/19/2024

A Comparative Study of Quality Evaluation Methods for Text Summarization

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding

Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

7/2/2024

🛸

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan

While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction.

7/15/2024

💬

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Marcio Fonseca, Shay B. Cohen

In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.

6/28/2024