MiniLLM: Knowledge Distillation of Large Language Models

2306.08543

Published 4/11/2024 by Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

MiniLLM: Knowledge Distillation of Large Language Models

Abstract

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

Create account to get full access

Overview

This paper explores techniques for distilling knowledge from large language models (LLMs) into smaller, more efficient models.
The authors investigate several knowledge distillation (KD) approaches, including Rethinking Kullback-Leibler Divergence for Knowledge Distillation, GOLD: Generalized Out-of-Distribution Learning via Out-Distribution Knowledge Distillation, and Improve Knowledge Distillation via Label Revision and Data Augmentation.
The paper also provides a comprehensive review of knowledge distillation in computer vision and introduces a new technique called MMIDR: Teaching a Large Language Model to Interpret.

Plain English Explanation

Large language models (LLMs) like GPT-3 are incredibly powerful, but they can also be very large and computationally expensive to run. Knowledge distillation is a technique that aims to take the knowledge from a large, powerful model and transfer it to a smaller, more efficient model. This allows the benefits of the large model to be enjoyed without the high computational cost.

The authors of this paper explore several different approaches to knowledge distillation for LLMs. They look at ways to improve the standard knowledge distillation process, such as by using different ways of measuring the "distance" between the large and small models. They also investigate techniques for training the small model to handle data that is outside of the normal training distribution, which can be a challenge.

Additionally, the paper provides a broad overview of knowledge distillation research in the field of computer vision, highlighting how the techniques developed there might be applicable to language models as well. Finally, the authors introduce a new method called MMIDR, which aims to not just transfer knowledge, but to also teach the small model how to interpret and explain its own outputs.

The key idea behind all of these approaches is to find ways to capture the rich knowledge and capabilities of large, powerful language models and distill them into smaller, more efficient models that can be more easily deployed and used in real-world applications.

Technical Explanation

The paper begins by surveying several recent advances in knowledge distillation (KD) for large language models (LLMs). First, it examines the Rethinking Kullback-Leibler Divergence for Knowledge Distillation approach, which proposes an alternative divergence measure to the standard Kullback-Leibler (KL) divergence used in KD.

Next, the authors explore the GOLD: Generalized Out-of-Distribution Learning via Out-Distribution Knowledge Distillation method, which aims to improve the small model's performance on data that is outside the normal training distribution.

The paper also covers the Improve Knowledge Distillation via Label Revision and Data Augmentation technique, which uses label revision and data augmentation to enhance the effectiveness of KD.

In addition to these specific KD approaches, the paper provides a comprehensive review of knowledge distillation in computer vision, highlighting how the lessons learned in that domain may be applicable to language models as well.

Finally, the authors introduce a new technique called MMIDR: Teaching a Large Language Model to Interpret, which aims to not only transfer knowledge from the large model to the small model, but also to teach the small model how to interpret and explain its own outputs.

Critical Analysis

The paper provides a thorough exploration of various knowledge distillation techniques for large language models, highlighting both the strengths and limitations of each approach. One potential limitation is that the paper does not delve deeply into the specific tradeoffs and performance implications of the different KD methods, which may be of interest to readers.

Additionally, while the authors introduce the MMIDR technique as a novel contribution, the paper does not provide a comprehensive evaluation of its performance compared to other KD approaches. It would be helpful to see more detailed results and analysis to understand the potential benefits and drawbacks of this new method.

Furthermore, the paper's discussion of the broader applications of knowledge distillation in computer vision could be expanded to more explicitly draw connections to the language model domain and discuss the unique challenges and considerations involved.

Overall, the paper presents a valuable survey of the state-of-the-art in knowledge distillation for large language models, but could be strengthened by a more in-depth critical analysis and a clearer articulation of the key trade-offs and open research questions in this area.

Conclusion

This paper explores various techniques for distilling the knowledge of large, powerful language models into smaller, more efficient models. The authors investigate several specific KD approaches, including novel divergence measures, out-of-distribution learning, and label revision and data augmentation. They also provide a comprehensive review of KD research in computer vision and introduce a new technique called MMIDR that aims to not only transfer knowledge but also teach the small model how to interpret its own outputs.

The key takeaway is that knowledge distillation offers a promising approach for making the capabilities of large language models more accessible and deployable in real-world applications, by transferring their rich knowledge to smaller, more efficient models. However, the paper also highlights the need for continued research to further improve the performance and robustness of these KD techniques, especially when dealing with the unique challenges of language modeling.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

Ehsan Latif, Luyang Fang, Ping Ma, Xiaoming Zhai

This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs) into smaller, more efficient, and accurate neural networks. We specifically target the challenge of deploying these models on resource-constrained devices. Our methodology involves training the smaller student model (Neural Network) using the prediction probabilities (as soft labels) of the LLM, which serves as a teacher model. This is achieved through a specialized loss function tailored to learn from the LLM's output probabilities, ensuring that the student model closely mimics the teacher's performance. To validate the performance of the KD approach, we utilized a large dataset, 7T, containing 6,684 student-written responses to science questions and three mathematical reasoning datasets with student-written responses graded by human experts. We compared accuracy with state-of-the-art (SOTA) distilled models, TinyBERT, and artificial neural network (ANN) models. Results have shown that the KD approach has 3% and 2% higher scoring accuracy than ANN and TinyBERT, respectively, and comparable accuracy to the teacher model. Furthermore, the student model size is 0.03M, 4,000 times smaller in parameters and x10 faster in inferencing than the teacher model and TinyBERT, respectively. The significance of this research lies in its potential to make advanced AI technologies accessible in typical educational settings, particularly for automatic scoring.

6/13/2024

cs.CL cs.AI

💬

Revisiting Knowledge Distillation for Autoregressive Language Models

Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, Dacheng Tao

Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we empirically find that larger teacher LMs might dramatically result in a poorer student. In response to this problem, we conduct a series of analyses and reveal that different tokens have different teaching modes, neglecting which will lead to performance degradation. Motivated by this, we propose a simple yet effective adaptive teaching approach (ATKD) to improve the KD. The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains (up to +3.04% average score) across all model types and sizes. More encouragingly, ATKD can improve the student model generalization effectively.

6/18/2024

cs.CL

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong

Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.

6/18/2024

cs.CL cs.AI

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou, Jing Zhu, Paiheng Xu, Xiaoyu Liu, Xiyao Wang, Danai Koutra, Wei Ai, Furong Huang

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

6/21/2024

cs.CL cs.AI