Physics Inspired Criterion for Pruning-Quantization Joint Learning

Read original: arXiv:2312.00851 - Published 6/5/2024 by Weiying Xie, Xiaoyi Fan, Xin Zhang, Yunsong Li, Jie Lei, Leyuan Fang

👀

Overview

This paper proposes a novel physics-inspired criterion for joint learning of model pruning and quantization, called PIC-PQ.
The method draws an analogy between elasticity dynamics and model compression, establishing a linear relationship between filter importance and filter properties.
PIC-PQ aims to achieve a good trade-off between model accuracy and bit-operations (BOPs) compression ratio for deploying deep neural networks (DNNs) on resource-constrained edge devices.

Plain English Explanation

Deploying deep neural networks on edge devices, such as smartphones or IoT sensors, can be challenging due to the limited computing resources and memory available on these devices. To address this, researchers often use techniques like pruning to remove unnecessary parts of the neural network and quantization to reduce the precision of the network's parameters.

In this paper, the authors propose a new approach called PIC-PQ that combines pruning and quantization in a joint learning process. They draw an analogy between the physics concept of elasticity dynamics and the process of compressing a neural network. This allows them to establish a relationship between the importance of different parts of the network (the "filters") and their properties, which can be learned during training.

The key idea is to find a way to prune and quantize the network that balances the trade-off between maintaining high accuracy and achieving a significant reduction in the number of bit-operations (BOPs) required to run the network. This is important for deploying DNNs on resource-constrained edge devices like smartphones or IoT sensors, where computing power and memory are limited.

Technical Explanation

The authors propose a novel physics-inspired criterion for pruning-quantization joint learning (PIC-PQ), which is derived from an analogy between elasticity dynamics (ED) and model compression (MC). Specifically, they establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics-inspired criterion (PIC).

Furthermore, the authors extend PIC with a relative shift variable to provide a global view. To ensure feasibility and flexibility, they introduce available maximum bitwidth and penalty factor in the quantization bitwidth assignment.

Experiments on image classification benchmarks, such as CIFAR-10 and ImageNet, demonstrate that PIC-PQ can achieve a good trade-off between accuracy and BOPs compression ratio. For example, PIC-PQ achieves a 54.96X BOPs compression ratio in ResNet56 on CIFAR-10 with a 0.10% accuracy drop, and a 53.24X compression ratio in ResNet18 on ImageNet with a 0.61% accuracy drop.

Critical Analysis

The authors provide a novel and interesting approach to jointly learning pruning and quantization for deep neural networks. The physics-inspired analogy used to establish the relationship between filter importance and properties is a unique perspective that could lead to further insights in model compression research.

However, the paper does not provide much discussion on the limitations of the PIC-PQ method. It would be helpful to understand the scenarios where the approach may not work as well, such as potential issues with certain network architectures or tasks. Additionally, the authors do not address the computational overhead of the joint learning process, which could be a concern for real-world deployment.

Further research could explore ways to make the PIC-PQ method more efficient or generalizable, as well as investigate the broader implications of using physics-inspired principles in model compression techniques.

Conclusion

This paper presents a novel physics-inspired criterion for pruning-quantization joint learning (PIC-PQ), which aims to achieve a good balance between model accuracy and bit-operations compression ratio for deploying deep neural networks on resource-constrained edge devices. The key innovation is the use of an analogy between elasticity dynamics and model compression to establish a linear relationship between filter importance and filter properties, which is then leveraged in the joint learning process.

The experimental results demonstrate the effectiveness of PIC-PQ in compressing various neural network architectures while maintaining high accuracy, which could have significant implications for the deployment of deep learning models on edge devices with limited computing resources and memory.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Physics Inspired Criterion for Pruning-Quantization Joint Learning

Weiying Xie, Xiaoyi Fan, Xin Zhang, Yunsong Li, Jie Lei, Leyuan Fang

Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an analogy we first draw between elasticity dynamics (ED) and model compression (MC). Specifically, derived from Hooke's law in ED, we establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics inspired criterion (PIC). Furthermore, we extend PIC with a relative shift variable for a global view. To ensure feasibility and flexibility, available maximum bitwidth and penalty factor are introduced in quantization bitwidth assignment. Experiments on benchmarks of image classification demonstrate that PIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs) compression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10 with 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy drop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.

6/5/2024

Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

Beatrice Alessandra Motetti, Matteo Risso, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.

7/2/2024

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Chen Tang, Yuan Meng, Jiacheng Jiang, Shuzhao Xie, Rongwei Lu, Xinzhu Ma, Zhi Wang, Wenwu Zhu

Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}.

6/17/2024

DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding

Jooyoung Lee, Se Yoon Jeong, Munchurl Kim

Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size.

8/23/2024