Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

2407.01054

Published 7/2/2024 by Beatrice Alessandra Motetti, Matteo Risso, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

cs.LG

Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

Abstract

The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.

Create account to get full access

Overview

This research paper presents a novel approach to efficiently deploy deep neural networks on edge devices by combining two key techniques: joint pruning and channel-wise mixed-precision quantization.
Pruning selectively removes unnecessary network connections to reduce model size, while quantization compresses the model by reducing the precision of weights and activations.
The authors propose a joint optimization strategy that leverages the synergies between these two methods to achieve even greater efficiency gains.

Plain English Explanation

The paper tackles the challenge of running complex deep learning models on resource-constrained edge devices like smartphones or IoT sensors. These devices often have limited processing power and memory, making it difficult to deploy large, state-of-the-art neural networks.

To address this, the researchers use two complementary techniques: pruning and quantization.

Pruning involves selectively removing unnecessary connections in the neural network, kind of like trimming the branches of a tree to make it more compact. This helps reduce the overall model size and computational requirements.

Quantization compresses the model by reducing the precision of the numerical values (weights and activations) used in the network. For example, instead of using 32-bit floating-point numbers, you might use 8-bit integers. This drastically reduces memory usage and inference time without significantly degrading model performance.

The key innovation in this paper is that the authors jointly optimize the pruning and quantization processes. By considering them together, they can find the most efficient combination of pruning and quantization for a given model and hardware setup. This allows them to push the boundaries of model compression and achieve even greater efficiency gains compared to using the techniques separately.

Technical Explanation

The paper proposes a two-stage approach to joint pruning and channel-wise mixed-precision quantization. In the first stage, the model is pruned using a novel multi-dimensional pruning technique that considers both channel-level and layer-level importance. This selectively removes unimportant connections to reduce the model's overall size.

In the second stage, the pruned model is quantized using a channel-wise mixed-precision quantization scheme. This allows different channels within the same layer to be quantized to different precisions, offering more flexibility than uniform quantization. The authors develop a custom quantization-aware training process to fine-tune the model and maintain its performance.

The key insight is that by jointly optimizing the pruning and quantization steps, the method can find the most efficient combination of the two techniques for a given model and target hardware. This leverages the synergies between pruning and quantization to achieve even greater model compression and efficiency gains compared to using them separately.

Extensive experiments on various computer vision and natural language processing models demonstrate the effectiveness of the proposed approach. The authors show significant improvements in model size, latency, and energy consumption while maintaining high accuracy, outperforming state-of-the-art model compression techniques.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed joint pruning and quantization method. The authors have considered a diverse set of models and tasks, providing confidence in the generalizability of their findings.

However, the paper does not address the potential drawbacks or limitations of the approach. For example, the impact of the joint optimization on the training process complexity and convergence is not discussed. Additionally, the paper does not explore the trade-offs between model compression, inference latency, and energy consumption in depth, which would be valuable for practitioners to understand.

Further research could investigate the performance of the joint pruning and quantization method on a broader range of hardware platforms, including edge devices with heterogeneous computing resources. Exploring the synergies between these techniques and other model compression approaches, such as knowledge distillation, could also lead to even more efficient neural network deployments.

Conclusion

This research paper presents a novel approach to efficiently deploying deep neural networks on resource-constrained edge devices. By jointly optimizing a combination of pruning and channel-wise mixed-precision quantization, the authors are able to achieve significant improvements in model size, latency, and energy consumption while maintaining high accuracy.

The key innovation is the ability to leverage the synergies between these two complementary techniques, leading to greater efficiency gains compared to using them separately. This work represents an important step forward in enabling the deployment of advanced deep learning models on a wide range of edge computing platforms, with potential applications in areas such as mobile computing, IoT, and embedded systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

✨

AMED: Automatic Mixed-Precision Quantization for Edge Devices

Moshe Kimhi, Tal Rozen, Avi Mendelson, Chaim Baskin

Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance. This makes them highly appropriate for systems with limited resources and low power capacity. Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths. Quantization methods either aim to minimize the compression loss given a desired reduction or optimize a dependent variable for a specified property of the model (such as FLOPs or model size); both make the performance inefficient when deployed on specific hardware, but more importantly, quantization methods assume that the loss manifold holds a global minimum for a quantized model that copes with the global minimum of the full precision counterpart. Challenging this assumption, we argue that the optimal minimum changes as the precision changes, and thus, it is better to look at quantization as a random process, placing the foundation for a different approach to quantize neural networks, which, during the training procedure, quantizes the model to a different precision, looks at the bit allocation as a Markov Decision Process, and then, finds an optimal bitwidth allocation for measuring specified behaviors on a specific device via direct signals from the particular hardware architecture. By doing so, we avoid the basic assumption that the loss behaves the same way for a quantized model. Automatic Mixed-Precision Quantization for Edge Devices (dubbed AMED) demonstrates its superiority over current state-of-the-art schemes in terms of the trade-off between neural network accuracy and hardware efficiency, backed by a comprehensive evaluation.

6/11/2024

cs.LG

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose Alvarez

As we push the boundaries of performance in various vision tasks, the models grow in size correspondingly. To keep up with this growth, we need very aggressive pruning techniques for efficient inference and deployment on edge devices. Existing pruning approaches are limited to channel pruning and struggle with aggressive parameter reductions. In this paper, we propose a novel multi-dimensional pruning framework that jointly optimizes pruning across channels, layers, and blocks while adhering to latency constraints. We develop a latency modeling technique that accurately captures model-wide latency variations during pruning, which is crucial for achieving an optimal latency-accuracy trade-offs at high pruning ratio. We reformulate pruning as a Mixed-Integer Nonlinear Program (MINLP) to efficiently determine the optimal pruned structure with only a single pass. Our extensive results demonstrate substantial improvements over previous methods, particularly at large pruning ratios. In classification, our method significantly outperforms prior art HALP with a Top-1 accuracy of 70.0(v.s. 68.6) and an FPS of 5262 im/s(v.s. 4101 im/s). In 3D object detection, we establish a new state-of-the-art by pruning StreamPETR at a 45% pruning ratio, achieving higher FPS (37.3 vs. 31.7) and mAP (0.451 vs. 0.449) than the dense baseline.

6/19/2024

cs.CV cs.AI cs.LG

🧠

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

5/24/2024

cs.LG cs.AI cs.AR

Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators

Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina

Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors, including a weight quantization strategy (i.e., data types and bit-widths) and mapping (i.e., placement and scheduling of DNN elementary operations on hardware units of the accelerator). We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings that utilize the hardware resources more effectively. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements compared to less carefully optimized CNN implementations. To find, analyze, and exploit these mappings, we: (i) extend a general-purpose state-of-the-art mapping tool (Timeloop) to support mixed quantization, which is not currently available; (ii) propose an efficient multi-objective optimization algorithm to find the most suitable bit-widths and mapping for each DNN layer executed on the accelerator; and (iii) conduct a detailed experimental evaluation to validate the proposed method. On two CNNs (MobileNetV1 and MobileNetV2) and two accelerators (Eyeriss and Simba) we show that for a given quality metric (such as the accuracy on ImageNet), energy savings are up to 37% without any accuracy drop.

4/9/2024

cs.AR cs.LG