GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation

2403.19754

Published 4/1/2024 by Mohsen Gholami, Mohammad Akbari, Cindy Hu, Vaden Masrani, Z. Jane Wang, Yong Zhang

GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation

Abstract

Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. The code is available.

Create account to get full access

Overview

This paper presents a new technique called GOLD (Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation) for improving the performance of language models.
GOLD aims to enhance the capability of smaller, more efficient language models by distilling knowledge from larger, more powerful models in a generalized way.
The key innovation is the use of out-of-distribution (OOD) data to guide the generation of diverse training data, which helps the smaller model learn a more comprehensive and robust understanding of language.

Plain English Explanation

The paper introduces a method called GOLD that helps make smaller language models just as capable as larger, more complex ones. Language models are AI systems that can understand and generate human language. Larger models tend to perform better, but they require more computing power and resources, which can limit their practical use.

GOLD works by taking the knowledge from a larger, more powerful language model and transferring it to a smaller model. This process is called "knowledge distillation." The novel aspect of GOLD is that it uses data that is outside the normal training distribution to guide the generation of new training data for the smaller model. This out-of-distribution data helps the smaller model learn a more complete and versatile understanding of language, beyond just the typical examples it would see during standard training.

By using this approach, the smaller model can gain capabilities that approach those of the larger, more complex model, but with much lower computational requirements. This makes the technology more practical and accessible for real-world applications.

Technical Explanation

The key components of the GOLD method are:

Knowledge Distillation: GOLD uses a pre-trained large language model as the "teacher" and trains a smaller "student" model to mimic the teacher's behavior. This allows the student model to benefit from the knowledge captured by the more powerful teacher.
Out-of-Distribution (OOD) Data Generation: GOLD generates new training data for the student model using OOD samples. These are examples that fall outside the typical distribution of the original training data. GOLD uses the teacher model to guide the generation of these diverse, OOD-inspired samples.
OOD-Guided Data Augmentation: The OOD-generated samples are combined with the original in-distribution training data to create an augmented dataset. This helps the student model learn a more comprehensive understanding of language.

The authors conduct experiments on standard language modeling benchmarks and show that GOLD can significantly improve the performance of smaller student models compared to standard knowledge distillation approaches. The OOD-guided data generation is a key factor in enabling the student model to match or even exceed the capabilities of the larger teacher model.

Critical Analysis

The paper provides a strong technical contribution by introducing a novel and effective approach to knowledge distillation for language models. The use of OOD data generation is a clever way to address the limitation of standard distillation methods, which can only learn from the original in-distribution training data.

However, the paper does not delve deeply into potential limitations or caveats of the GOLD method. For instance, the reliance on the teacher model to guide the OOD data generation could introduce biases or make the approach overly dependent on the quality of the teacher. Additionally, the computational overhead of the OOD data generation process is not thoroughly analyzed.

Further research could explore ways to make the OOD data generation more efficient and robust, as well as investigate how GOLD performs on a wider range of language tasks and domains beyond the standard benchmarks presented in the paper.

Conclusion

The GOLD method presents an innovative approach to knowledge distillation for language models, leveraging out-of-distribution data generation to help smaller models match the capabilities of larger, more complex models. This advance has the potential to make powerful language AI more accessible and practical for real-world applications. While the paper demonstrates the effectiveness of GOLD, further research is needed to fully understand its limitations and explore ways to optimize the technique.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

4/11/2024

cs.CL cs.AI

Lightweight Model Pre-training via Language Guided Knowledge Distillation

Mingsheng Li, Lin Zhang, Mingzhen Zhu, Zilong Huang, Gang Yu, Jiayuan Fan, Tao Chen

This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at https://github.com/mZhenz/LGD.

6/18/2024

cs.CV

Intermediate Distillation: Data-Efficient Distillation from Black-Box LLMs for Information Retrieval

Zizhong Li, Haopeng Zhang, Jiawei Zhang

Recent research has explored distilling knowledge from large language models (LLMs) to optimize retriever models, especially within the retrieval-augmented generation (RAG) framework. However, most existing training methods rely on extracting supervision signals from LLMs' weights or their output probabilities, which is not only resource-intensive but also incompatible with black-box LLMs. In this paper, we introduce textit{Intermediate Distillation}, a data-efficient knowledge distillation training scheme that treats LLMs as black boxes and distills their knowledge via an innovative LLM-ranker-retriever pipeline, solely using LLMs' ranking generation as the supervision signal. Extensive experiments demonstrate that our proposed method can significantly improve the performance of retriever models with only 1,000 training instances. Moreover, our distilled retriever model significantly boosts performance in question-answering tasks within the RAG framework, demonstrating the potential of LLMs to economically and effectively train smaller models.

6/19/2024

cs.IR

🌿

Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation

Joe Stacey, Marek Rei

Knowledge distillation optimises a smaller student model to behave similarly to a larger teacher model, retaining some of the performance benefits. While this method can improve results on in-distribution examples, it does not necessarily generalise to out-of-distribution (OOD) settings. We investigate two complementary methods for improving the robustness of the resulting student models on OOD domains. The first approach augments the distillation with generated unlabelled examples that match the target distribution. The second method upsamples data points among the training set that are similar to the target distribution. When applied on the task of natural language inference (NLI), our experiments on MNLI show that distillation with these modifications outperforms previous robustness solutions. We also find that these methods improve performance on OOD domains even beyond the target domain.

5/31/2024

cs.CL cs.LG