Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

2406.08660

Published 6/14/2024 by Martin Juan Jos'e Bucher, Marco Martini

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Abstract

Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.

Create account to get full access

Overview

This paper compares the performance of fine-tuned "small" language models (LLMs) against zero-shot generative AI models in text classification tasks.
The researchers find that fine-tuned "small" LLMs significantly outperform zero-shot generative AI models across a range of text classification benchmarks.
The paper provides insights into the trade-offs between model size, fine-tuning, and zero-shot performance in natural language processing (NLP) applications.

Plain English Explanation

The paper explores the performance of two different types of language models in text classification tasks. On one side, there are "small" language models that have been fine-tuned, or retrained, on specific datasets to become experts at particular tasks. On the other side, there are zero-shot generative AI models that can perform a wide variety of tasks without any additional training.

The researchers find that the fine-tuned "small" language models significantly outperform the zero-shot generative AI models across a range of text classification benchmarks. This suggests that for certain NLP applications, like text-classification task 1 or spam email detection, the specialized knowledge gained through fine-tuning is more valuable than the broad capabilities of zero-shot models.

The results highlight the trade-offs between model size, fine-tuning, and zero-shot performance. While zero-shot generative AI models can be applied to a wide range of tasks without additional training, fine-tuned "small" models may be better suited for specific applications where high accuracy is crucial. This information can help researchers and practitioners make more informed decisions when selecting language models for their NLP projects.

Technical Explanation

The paper evaluates the performance of fine-tuned "small" language models (LLMs) against zero-shot generative AI models on a variety of text classification tasks. The researchers fine-tuned several "small" LLMs, including BERT, RoBERTa, and DistilBERT, on specific datasets and compared their performance to that of zero-shot generative AI models like GPT-3 and InstructGPT.

The experiments were conducted on a range of text classification benchmarks, including sentiment analysis, topic classification, and spam detection. The researchers measured the classification accuracy of each model and found that the fine-tuned "small" LLMs significantly outperformed the zero-shot generative AI models across all the tasks.

The paper provides insights into the trade-offs between model size, fine-tuning, and zero-shot performance. While zero-shot generative AI models can be applied to a wide range of tasks without additional training, the specialized knowledge gained through fine-tuning allows "small" LLMs to achieve higher accuracy on specific applications. This suggests that for certain NLP tasks, the performance boost from fine-tuning may outweigh the versatility of zero-shot models.

Critical Analysis

The paper provides a comprehensive evaluation of fine-tuned "small" LLMs and zero-shot generative AI models in text classification tasks. The researchers acknowledge that the performance of these models may vary depending on the specific task and dataset, and they encourage further research in this area.

One potential limitation of the study is the use of a limited set of text classification benchmarks. While the selected tasks are representative of common NLP applications, it would be valuable to expand the evaluation to a wider range of benchmarks, including more specialized or domain-specific tasks, to better understand the relative strengths and weaknesses of the two model types.

Additionally, the paper does not delve into the computational and resource requirements of fine-tuning "small" LLMs versus deploying zero-shot generative AI models. This information could be crucial for practitioners who need to balance model performance with practical considerations, such as inference latency, energy consumption, or deployment constraints.

Overall, the paper presents a well-designed study and offers valuable insights into the trade-offs between fine-tuning and zero-shot performance in text classification. The findings can help researchers and practitioners make more informed decisions when selecting language models for their NLP projects, depending on their specific requirements and constraints.

Conclusion

This paper provides a comprehensive comparison of fine-tuned "small" language models and zero-shot generative AI models in text classification tasks. The key finding is that the fine-tuned "small" LLMs significantly outperform the zero-shot generative AI models across a range of benchmarks, highlighting the value of specialized knowledge gained through fine-tuning for certain NLP applications.

The results offer important insights into the trade-offs between model size, fine-tuning, and zero-shot performance, which can guide researchers and practitioners in selecting the most appropriate language models for their NLP projects. While zero-shot generative AI models offer versatility, fine-tuned "small" LLMs may be better suited for tasks where high accuracy is crucial, such as sentiment analysis or spam detection.

As the field of NLP continues to evolve, understanding the capabilities and limitations of different language models will be essential for developing effective and efficient solutions to a wide range of real-world problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Meysam Alizadeh, Mael Kubli, Zeynab Samei, Shirin Dehghani, Mohammadmasiha Zahedivafa, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi

This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research. By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis. Specifically, we conduct an assessment of both zero-shot and fine-tuned LLMs across a range of text annotation tasks using news articles and tweets datasets. Our analysis shows that fine-tuning improves the performance of open-source LLMs, allowing them to match or even surpass zero-shot GPT-3.5 and GPT-4, though still lagging behind fine-tuned GPT-3.5. We further establish that fine-tuning is preferable to few-shot training with a relatively modest quantity of annotated text. Our findings show that fine-tuned open-source LLMs can be effectively deployed in a broad spectrum of text annotation applications. We provide a Python notebook facilitating the application of LLMs in text annotation for other researchers.

5/30/2024

cs.CL

Zero-Shot Spam Email Classification Using Pre-trained Large Language Models

Sergio Rojas-Galeano

This paper investigates the application of pre-trained large language models (LLMs) for spam email classification using zero-shot prompting. We evaluate the performance of both open-source (Flan-T5) and proprietary LLMs (ChatGPT, GPT-4) on the well-known SpamAssassin dataset. Two classification approaches are explored: (1) truncated raw content from email subject and body, and (2) classification based on summaries generated by ChatGPT. Our empirical analysis, leveraging the entire dataset for evaluation without further training, reveals promising results. Flan-T5 achieves a 90% F1-score on the truncated content approach, while GPT-4 reaches a 95% F1-score using summaries. While these initial findings on a single dataset suggest the potential for classification pipelines of LLM-based subtasks (e.g., summarisation and classification), further validation on diverse datasets is necessary. The high operational costs of proprietary models, coupled with the general inference costs of LLMs, could significantly hinder real-world deployment for spam filtering.

5/28/2024

cs.CL cs.AI

💬

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Yanis Labrak, Mickael Rouvier, Richard Dufour

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.

6/11/2024

cs.CL cs.AI cs.LG

🏷️

Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM

Ruohong Zhang, Yau-Shian Wang, Yiming Yang

The remarkable performance of large language models (LLMs) in zero-shot language understanding has garnered significant attention. However, employing LLMs for large-scale inference or domain-specific fine-tuning requires immense computational resources due to their substantial model size. To overcome these limitations, we introduce a novel method, namely GenCo, which leverages the strong generative power of LLMs to assist in training a smaller and more adaptable language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. Firstly, the LLM is used to augment each input instance with a variety of possible continuations, enriching its semantic context for better understanding. Secondly, it helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels. This ensures the generated texts are highly relevant to the predicted labels, alleviating the prediction error during pseudo-labeling, while reducing the dependency on large volumes of unlabeled text. In our experiments, GenCo outperforms previous state-of-the-art methods when only limited ($<5%$ of original) in-domain text data is available. Notably, our approach surpasses the performance of Alpaca-7B with human prompts, highlighting the potential of leveraging LLM for self-training.

4/16/2024

cs.CL cs.AI