Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Read original: arXiv:2311.09184 - Published 7/15/2024 by Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan

🛸

Overview

Large language models (LLMs) can already perform well on standard summarization benchmarks, but their capabilities on more complex summarization tasks are less studied.
This research paper evaluates LLM performance on "instruction-controllable text summarization" - where the model input includes both a source article and natural language instructions for the desired summary characteristics.
The researchers curate a dataset for this task, conduct human evaluations of LLM-based summarization systems, and benchmark automatic evaluation methods.

Plain English Explanation

Large language models (LLMs) are software systems trained on massive amounts of text data to understand and generate human-like language. These models have shown strong performance on standard summarization tasks, where the goal is to condense the key points of a given text into a shorter summary.

However, the researchers in this paper wanted to explore how well LLMs can handle more complex summarization scenarios. Specifically, they looked at "instruction-controllable text summarization," where the model not only has to summarize an input article, but also follow natural language instructions about the desired characteristics of the summary (e.g. "Provide a concise summary focusing on the main findings").

To test this, the researchers curated a new dataset and had human evaluators assess the quality of summaries generated by various LLM-based systems. They also benchmarked different automatic evaluation methods to see how well they aligned with the human judgments.

The key findings are that instruction-controlled summarization remains a challenging task for LLMs. The models still make factual errors and struggle to fully capture the nuances specified in the instructions. Additionally, the automatic evaluation methods tested did not strongly correlate with human assessments of summary quality.

This research highlights the limitations of current LLMs when it comes to more complex, context-dependent language tasks. While these models excel at many language-related applications, there is still work to be done to make them truly "controllable" and adaptable to specialized summarization requirements.

Technical Explanation

The researchers first curated an evaluation-only dataset for the task of instruction-controllable text summarization. This dataset contains source articles paired with natural language instructions specifying the desired characteristics of the summary (e.g. length, focus, tone).

They then evaluated the performance of five different LLM-based summarization systems on this dataset, using human annotators to judge the quality of the generated summaries. The human evaluations assessed factors like factual accuracy, relevance to instructions, and overall coherence.

Additionally, the researchers benchmarked 40 different automatic evaluation methods (spanning 4 protocols and 11 LLMs) to see how well they aligned with the human judgments. This allowed them to assess the reliability of using automated metrics to evaluate instruction-controlled summaries.

The key findings were:

All the LLM-based systems evaluated still made factual errors and struggled to fully capture the nuances specified in the instructions.
No automatic evaluation method achieved a strong correlation with human assessments of summary quality.
There were large performance gaps between different LLMs in both summary generation and evaluation capabilities.

These results suggest that instruction-controllable summarization remains a challenging task for current LLMs. The models have difficulty translating high-level natural language directives into coherent, faithful summaries. Additionally, automatically evaluating the quality of such summaries is an open problem that requires further research.

Critical Analysis

The researchers acknowledge several limitations and avenues for future work in this area:

The dataset they curated is evaluation-only, so it cannot be used for model training or development. Expanding the dataset to include training data could help LLMs learn to better handle instruction-controlled summarization.
The human evaluation process relied on a relatively small number of annotators (around 30). Expanding the pool of human raters could provide more robust and reliable assessments of summary quality.
The automatic evaluation methods tested did not achieve strong correlations with human judgments. Developing more sophisticated evaluation protocols tailored to instruction-controlled summarization could be an important area of future research.
The study only looked at a limited set of LLM-based systems. Expanding the scope to include a wider range of models and architectures could lead to additional insights.

Overall, this research highlights the limitations of current LLMs when it comes to complex, context-dependent language tasks like instruction-controlled summarization. While these models have made impressive strides, there is still significant work to be done to make them truly "controllable" and adaptive to specialized requirements.

Conclusion

This paper provides a detailed evaluation of large language models' capabilities on the task of instruction-controllable text summarization. The key findings are that LLMs still struggle with this complex task, making factual errors and failing to fully capture the nuances specified in the natural language instructions.

The researchers also found that existing automatic evaluation methods do not reliably align with human judgments of summary quality, suggesting the need for more sophisticated evaluation protocols tailored to this domain.

While LLMs have shown remarkable progress in many language-related applications, this study underscores the limitations of these models when it comes to handling specialized, context-dependent requirements. Continued research in this area could lead to significant advancements in developing truly controllable and adaptable language AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan

While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction.

7/15/2024

💬

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Marcio Fonseca, Shay B. Cohen

In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.

6/28/2024

💬

On Learning to Summarize with Large Language Models as References

Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan

Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs' supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs' summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development.

7/19/2024

A Comparative Study of Quality Evaluation Methods for Text Summarization

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding

Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

7/2/2024