Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

2312.15842

Published 6/13/2024 by Ehsan Latif, Luyang Fang, Ping Ma, Xiaoming Zhai

Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

Abstract

This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs) into smaller, more efficient, and accurate neural networks. We specifically target the challenge of deploying these models on resource-constrained devices. Our methodology involves training the smaller student model (Neural Network) using the prediction probabilities (as soft labels) of the LLM, which serves as a teacher model. This is achieved through a specialized loss function tailored to learn from the LLM's output probabilities, ensuring that the student model closely mimics the teacher's performance. To validate the performance of the KD approach, we utilized a large dataset, 7T, containing 6,684 student-written responses to science questions and three mathematical reasoning datasets with student-written responses graded by human experts. We compared accuracy with state-of-the-art (SOTA) distilled models, TinyBERT, and artificial neural network (ANN) models. Results have shown that the KD approach has 3% and 2% higher scoring accuracy than ANN and TinyBERT, respectively, and comparable accuracy to the teacher model. Furthermore, the student model size is 0.03M, 4,000 times smaller in parameters and x10 faster in inferencing than the teacher model and TinyBERT, respectively. The significance of this research lies in its potential to make advanced AI technologies accessible in typical educational settings, particularly for automatic scoring.

Create account to get full access

Overview

This paper explores the use of large language models (LLMs) for automatic scoring in educational applications.
The researchers investigate knowledge distillation techniques to create smaller, more efficient models that can match the performance of the larger LLMs.
The goal is to develop educational technology that can provide accurate, scalable, and cost-effective assessment solutions.

Plain English Explanation

Large language models (LLMs) like BERT have shown impressive capabilities in various natural language processing tasks. The researchers in this paper explore how these powerful LLMs can be used to automatically score student responses in educational settings.

However, LLMs can be computationally intensive and expensive to deploy at scale. To address this, the researchers use a technique called knowledge distillation to create smaller, more efficient models that can match the performance of the larger LLMs.

The idea behind knowledge distillation is to train a smaller "student" model to mimic the behavior of a larger "teacher" model. By learning from the teacher's knowledge, the student model can achieve similar levels of accuracy while being much more lightweight and cost-effective to deploy.

The researchers explore different approaches to knowledge distillation, including label revision and leveraging visual features, to create efficient models for automatic scoring of student responses.

By developing these more accessible and scalable LLM-based assessment solutions, the researchers aim to empower educators and technology providers with tools that can deliver accurate, personalized, and comprehensive feedback to students.

Technical Explanation

The paper begins by highlighting the potential of large language models (LLMs) for automatic scoring in educational applications. LLMs, such as BERT, have shown impressive performance on natural language understanding tasks, making them well-suited for evaluating and providing feedback on student responses.

However, the authors acknowledge the computational and cost challenges of deploying these large models at scale. To address this, they investigate knowledge distillation techniques to create smaller, more efficient models that can match the performance of the larger LLMs.

The researchers experiment with different knowledge distillation approaches, including label revision and leveraging visual features. In the label revision method, the student model is trained on revised labels obtained by applying the teacher model to the training data. This helps the student model learn more accurate representations.

The researchers also explore incorporating visual features, such as formatting and layout, into the distillation process. This DistilDoc approach can be particularly beneficial for scoring visually-rich documents, like essays or reports.

Through extensive experiments, the authors demonstrate that the distilled student models can achieve comparable or even superior performance to the larger LLMs, while being significantly more efficient and cost-effective to deploy.

The paper also delves into the sentence-level or token-level implications of the distillation process, providing insights into the trade-offs and performance characteristics at different granularities.

Critical Analysis

The paper presents a well-designed and thorough investigation into the use of knowledge distillation techniques to create efficient LLM-based models for automatic scoring in educational applications. The authors have carefully considered the practical challenges of deploying large, computationally-intensive models and have proposed innovative solutions to address them.

One potential limitation of the research is the focus on a specific task (automatic scoring) and the use of a limited set of datasets. While the results are promising, it would be valuable to explore the generalizability of the distillation approaches to a broader range of educational tasks and settings.

Additionally, the paper does not delve deeply into the potential biases or ethical considerations that may arise from the use of LLMs in educational assessment. As these models are trained on large, potentially biased datasets, it is essential to carefully evaluate their fairness and potential for perpetuating or exacerbating existing inequities.

Further research could also investigate the interpretability and transparency of the distilled models, as well as their ability to provide meaningful and actionable feedback to students and teachers. Exploring the integration of these models with other educational technologies, such as adaptive learning systems or personalized tutoring, could also be a fruitful area of exploration.

Conclusion

This paper presents an important contribution to the field of educational technology by demonstrating the potential of large language models for automatic scoring and the feasibility of using knowledge distillation to create more efficient and accessible solutions.

The researchers have shown that it is possible to maintain the performance of large LLMs while drastically reducing the computational and cost barriers to deployment. This paves the way for the widespread adoption of LLM-based assessment tools in educational settings, potentially enabling more personalized, scalable, and data-driven feedback for students.

As the field of educational technology continues to evolve, the insights and techniques presented in this paper will likely have a significant impact on the development of next-generation assessment and learning tools, ultimately enhancing the educational experiences and outcomes for students.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

4/11/2024

cs.CL cs.AI

💬

Revisiting Knowledge Distillation for Autoregressive Language Models

Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, Dacheng Tao

Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we empirically find that larger teacher LMs might dramatically result in a poorer student. In response to this problem, we conduct a series of analyses and reveal that different tokens have different teaching modes, neglecting which will lead to performance degradation. Motivated by this, we propose a simple yet effective adaptive teaching approach (ATKD) to improve the KD. The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains (up to +3.04% average score) across all model types and sizes. More encouragingly, ATKD can improve the student model generalization effectively.

6/18/2024

cs.CL

🧠

Improving Neural Topic Models with Wasserstein Knowledge Distillation

Suman Adhya, Debarshi Kumar Sanyal

Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher's. The distilled model also outperforms several other competitive topic models on topic coherence.

6/21/2024

cs.CL cs.IR cs.LG

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou, Jing Zhu, Paiheng Xu, Xiaoyu Liu, Xiyao Wang, Danai Koutra, Wei Ai, Furong Huang

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

6/21/2024

cs.CL cs.AI