Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

2403.19930

Published 4/1/2024 by Shulin Liu, Chengcheng Xu, Hao Liu, Tinghao Yu, Tao Yang

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

Abstract

The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.

Create account to get full access

Introduction

This paper examines the performance of Large Language Models (LLMs) on natural language understanding (NLU) tasks, specifically for the task of Chinese short text matching. The authors find that while LLMs like GPT-3 and LLaMA excel at natural language generation, they struggle to match the performance of fine-tuned smaller models like BERT on NLU tasks.

The paper explores three key factors that affect the performance of LLMs on the Chinese short text matching task:

Task modeling methods: The authors compare modeling the task as a generative task (generating the target label) versus a discriminative classification task (using the LLM outputs for binary classification).
Prompt formats: The paper examines the impact of different prompt styles, including a concise direct prompt versus a more detailed instructional prompt.
Output formats: The authors investigate the effect of incorporating Chain of Thought (CoT) into the output format during training.

Experiments on two Chinese short text matching datasets show that the fine-tuned CLLM-7B (a Chinese-enhanced LLaMA model) outperforms both fine-tuned BERT and few-shot GPT-4. The results also indicate that the generative modeling approach outperforms the discriminative approach, especially with limited training data. Finally, the experiments reveal that incorporating CoT into the output format is beneficial for the supervised matching task.

Backgrounds

The text provides an overview of the Chinese short text matching task and the datasets used in this study. Chinese short text matching, which aims to identify semantic equivalence between sentence pairs, is a fundamental natural language processing task with applications in question answering and dialogue systems.

The study examines two widely-used Chinese short text matching datasets: LCQMC and BQ. LCQMC is a large-scale, open-domain question matching corpus with 260,068 Chinese search query pairs, while BQ is a domain-specific, large-scale corpus for bank question matching with 120,000 Chinese sentence pairs. Both datasets have binary labels indicating whether the sentence pairs share the same meaning.

The evaluation metric used is accuracy, which measures the percentage of correctly predicted examples.

Experiments and Results

This section outlines the experimental configurations and presents the results. The authors examine the influence of three factors discussed in Section 1 through the following experiments:

Generative vs. Discriminative Models:
- The authors model the matching task as both a generative and a discriminative task. The generative model (CLLM-7B-GEN) merges the provided pair of sentences and prompts the model to generate the target label. The discriminative model (CLLM-7B-CLS) concatenates the given pair of texts as input, extracts vector representations, and performs binary classification.
- The results show that when the number of training samples is less than 20,000, the generative model (CLLM-GEN) significantly outperforms the discriminative models, including BERT and CLLM-CLS, on both the LCQMC and BQ datasets. This suggests that the generative approach aligns better with the pre-training procedure of large language models.
Concise vs. Complex Prompts:
- The authors compare the performance of models trained with concise and complex prompts. The results show that the models achieve comparable performance, indicating that supervised large language models are not sensitive to prompt design.
Effects of Chain-of-Thought (CoT):
- The authors investigate the effectiveness of CoT in supervised settings. They obtain CoT for the training samples by using GPT-4 to determine whether a given pair of texts is equivalent, and utilize the generated explanations as CoT.
- The results demonstrate that adding CoT to the training samples improves the performance of the generative model (CLLM-GEN-CoT) on both the LCQMC and BQ datasets, particularly for the more challenging BQ dataset.

Conclusions

The paper conducts an experimental study on fine-tuning large language models (LLMs) for the task of Chinese short text matching. It investigates various factors that affect performance during the fine-tuning process, including task modeling methods, prompt formats, and the chain of thought approach.

Experiments were systematically carried out on two widely used datasets. The results show that the fine-tuned CLLM-7B model outperforms both fine-tuned BERT and few-shot GPT-4, indicating that LLMs can serve as effective backbones in supervised scenarios. Additionally, the generative paradigm is found to be superior to the discriminative approach, especially when training data is limited.

The paper also reveals that supervised LLMs are less sensitive to prompts compared to zero- and few-shot LLMs. Furthermore, the chain of thought approach is beneficial for supervised text matching. While the experiments focus on text matching, the observations may be applicable to other natural language understanding tasks, such as text classification.

tations

The study has two key limitations. First, prompt engineering is critical for zero-shot and few-shot large language models (LLMs). The researchers assessed the few-shot performance of GPT-4, as shown in Figure 2. Despite their careful design of the few-shot prompts, the prompt designs remain subjective and may not represent the optimal choices. Second, the study focuses solely on the text matching task. Additional experiments may be needed to determine if the conclusions apply to other natural language understanding (NLU) tasks, such as text classification.

Appendix A Appendix

Figure 6: An illustration of 2-shot prompt for LCQMC.

Figure 7: An illustration of 2-shot prompt for BQ.

Figure 8: Examples of complex and simple prompts in Section3.2

Figure 9: The Chinese version of texts in Figure 4

Figure 10: An example of training sample with CoT.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Measuring Taiwanese Mandarin Language Understanding

Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, Yun-Nung Chen

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

4/1/2024

cs.CL

Supervised Knowledge Makes Large Language Models Better In-context Learners

Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

4/12/2024

cs.CL cs.AI

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Xuan Ren, Biao Wu, Lingqiao Liu

This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is simply due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more familiar with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the familiarity and our conclusion reveals that this familiarity significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other tasks after fine-tuning on a specific task.

6/4/2024

cs.CL cs.AI

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

4/16/2024

cs.CL