FreeKD: Knowledge Distillation via Semantic Frequency Prompt

Read original: arXiv:2311.12079 - Published 5/24/2024 by Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang

🌿

Overview

Knowledge distillation (KD) is a technique that aims to improve the performance of a smaller "student" model by learning from a larger "teacher" model.
Mainstream KD methods typically focus on spatial imitation, where the student model tries to mimic the spatial features of the teacher model.
However, the downsampling process in the teacher model can introduce corruption, making it difficult for the student to learn the most important information.
The authors propose a new approach called FreeKD that shifts the focus to the frequency domain, allowing the student to better understand the underlying patterns in the teacher's feature maps.

Plain English Explanation

Knowledge distillation is a way to make a smaller, less powerful AI model perform better by learning from a larger, more capable model. Typically, this is done by having the smaller model try to mimic the spatial features (the arrangement and layout of information) in the larger model.

However, the process of downsampling, or reducing the size of the feature maps in the larger model, can introduce some "corruption" or noise that makes it hard for the smaller model to figure out exactly what information it should be trying to copy.

The researchers behind FreeKD decided to approach the problem differently. Instead of focusing on the spatial features, they looked at the frequency domain, which is all about how the information is distributed across different frequency bands (low, medium, and high).

They found that the low-frequency bands contain general information, while the high-frequency bands are more detailed but also more noisy. So, they developed a way to identify the most important pixels within these frequency bands and have the smaller model focus on imitating those specific areas.

This FreeKD approach not only leads to better performance for the smaller model, but also makes it more robust and able to handle a wider range of inputs. The researchers showed that it outperforms traditional spatial-based distillation methods on tasks like object detection and semantic segmentation.

Technical Explanation

The key innovations in the FreeKD approach are:

Frequency Prompt: The authors introduce a "Frequency Prompt" that is plugged into the teacher model during finetuning. This allows the teacher to absorb semantic context in the frequency domain.
Frequency Mask: During the distillation process, a pixel-wise frequency mask is generated using the Frequency Prompt. This mask helps the student model focus on the most important pixels within the different frequency bands.
Position-Aware Relational Frequency Loss: For dense prediction tasks like object detection and segmentation, the authors employ a position-aware relational frequency loss. This applies a high-order spatial enhancement to the student model, further improving its performance.

The authors evaluate FreeKD on a variety of dense prediction tasks, including object detection on COCO2017 and semantic segmentation on Cityscapes. They show that FreeKD outperforms traditional spatial-based distillation methods, achieving significant gains in accuracy (e.g., 3.8 AP boost for RepPoints-R50 on COCO2017, 4.55 mIoU boost for PSPNet-R18 on Cityscapes).

The authors also validate the generalization of their approach by applying it to large-scale vision models like DINO and SAM.

Critical Analysis

The FreeKD approach addresses an important limitation of traditional spatial-based knowledge distillation methods, which can struggle with the corrupted feature maps introduced by the downsampling process in the teacher model.

By shifting the focus to the frequency domain, the authors are able to better understand and leverage the underlying patterns in the teacher's feature maps, leading to improved performance in the student model.

However, the paper does not provide a detailed analysis of the computational cost or memory requirements of the FreeKD approach. It would be helpful to understand the trade-offs in terms of efficiency compared to other knowledge distillation methods.

Additionally, the authors only evaluate FreeKD on dense prediction tasks like object detection and segmentation. It would be interesting to see how the method performs on other types of computer vision problems, such as image classification or super-resolution.

Overall, the FreeKD approach represents a promising step forward in knowledge distillation, and the authors have demonstrated its effectiveness on several important computer vision tasks.

Conclusion

The FreeKD method proposed in this paper offers a novel approach to knowledge distillation that shifts the focus from the spatial domain to the frequency domain. By identifying and leveraging the most informative pixels within different frequency bands, the student model is able to learn more effectively from the teacher, leading to significant performance gains on dense prediction tasks.

This research highlights the importance of looking beyond traditional spatial-based distillation techniques and exploring alternative ways to extract and transfer knowledge from large, powerful models to smaller, more efficient ones. As AI models continue to grow in size and complexity, techniques like FreeKD will become increasingly valuable for deploying high-performing AI systems on a wide range of devices and platforms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang

Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).

5/24/2024

PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

Gyeongman Kim, Doohyuk Jang, Eunho Yang

Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.

6/26/2024

🖼️

Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition

Guangyu Guo, Dingwen Zhang, Longfei Han, Nian Liu, Ming-Ming Cheng, Junwei Han

Previous knowledge distillation (KD) methods mostly focus on compressing network architectures, which is not thorough enough in deployment as some costs like transmission bandwidth and imaging equipment are related to the image size. Therefore, we propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources. Specifically, we first propose an input spatial representation distillation (ISRD) mechanism to transfer spatial knowledge from large images to student's input module, which can facilitate stable knowledge transfer between CNN and ViT. Then, a Teacher-Assistant-Student (TAS) framework is further established to disentangle pixel distillation into the model compression stage and input compression stage, which significantly reduces the overall complexity of pixel distillation and the difficulty of distilling intermediate knowledge. Finally, we adapt pixel distillation to object detection via an aligned feature for preservation (AFP) strategy for TAS, which aligns output dimensions of detectors at each stage by manipulating features and anchors of the assistant. Comprehensive experiments on image classification and object detection demonstrate the effectiveness of our method. Code is available at https://github.com/gyguo/PixelDistillation.

7/11/2024

Frequency-mix Knowledge Distillation for Fake Speech Detection

Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.

6/17/2024