Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

2403.19969

Published 4/1/2024 by Guanhua Ding, Zexi Ye, Zhen Zhong, Gang Li, David Shao

Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

Abstract

Deep Neural Network (DNN) pruning has emerged as a key strategy to reduce model size, improve inference latency, and lower power consumption on DNN accelerators. Among various pruning techniques, block and output channel pruning have shown significant potential in accelerating hardware performance. However, their accuracy often requires further improvement. In response to this challenge, we introduce a separate, dynamic and differentiable (SMART) pruner. This pruner stands out by utilizing a separate, learnable probability mask for weight importance ranking, employing a differentiable Top k operator to achieve target sparsity, and leveraging a dynamic temperature parameter trick to escape from non-sparse local minima. In our experiments, the SMART pruner consistently demonstrated its superiority over existing pruning methods across a wide range of tasks and models on block and output channel pruning. Additionally, we extend our testing to Transformer-based models in N:M pruning scenarios, where SMART pruner also yields state-of-the-art results, demonstrating its adaptability and robustness across various neural network architectures, and pruning types.

Create account to get full access

Introduction

This paper focuses on improving the deployment efficiency of deep neural networks (DNNs) on resource-constrained edge devices. One popular approach for this is neural network pruning, which removes less significant parameters from the network to conserve computational resources. The paper discusses various hardware-supported pruning techniques, including N:M pruning, block pruning, and output channel pruning.

The paper then reviews the limitations of existing pruning methods, such as the risk of pruning important weights in criteria-based approaches, the inability of weight-frozen learnable mask methods to adapt to weight changes, and the "STE learning delay problem" in straight-through estimator (STE)-based approaches.

To address these challenges, the authors propose a novel pruning method called SMART pruning. SMART uses a separate learnable soft probability mask to rank the importance of weights across different layers, which helps capture cross-layer dependencies more effectively. It also employs a differentiable Top-k operator and a dynamic temperature parameter trick to facilitate the mask's convergence to a binary mask without the risk of getting stuck in a non-sparse local minimum.

Compared to prior techniques, SMART offers several advantages: it integrates weight importance ranking into the training process, dynamically adapts the mask to weight changes, eliminates the need for complicated auxiliary model training, and provides convergence guarantees. The paper also includes theoretical analysis and experimental results demonstrating the state-of-the-art performance of SMART pruning across various models and tasks.

Methodology

The text introduces the methodology of the SMART pruner. Section 2.1 presents the differentiable Top k operator, which is a key component of the SMART pruner. Section 2.2 explains the SMART pruning algorithm, including the problem formulation and training flow. Finally, Section 2.3 discusses the dynamic temperature parameter trick and its role in avoiding non-sparse local minima.

$Figure 1: Illustration of the SMART pruner. In the forward pass, the original weights (top left) are element-wise multiplied by a rescaled importance mask, fτ⁢(m)subscript𝑓𝜏𝑚f_{\tau}(m)italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_m ), generated from a differentiable top-k𝑘kitalic_k operator, to obtain the masked weights matrix (top right). In the backward pass, the original weights parameter w𝑤witalic_w and mask parameter m𝑚mitalic_m are updated via backpropagation.$

Figure 1: Illustration of the SMART pruner. In the forward pass, the original weights (top left) are element-wise multiplied by a rescaled importance mask, fτ⁢(m)subscript𝑓𝜏𝑚f_{\tau}(m)italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_m ), generated from a differentiable top-k𝑘kitalic_k operator, to obtain the masked weights matrix (top right). In the backward pass, the original weights parameter w𝑤witalic_w and mask parameter m𝑚mitalic_m are updated via backpropagation.

The paper discusses a differentiable Top k operator and a SMART pruning algorithm that leverages this operator. The key points are:

The standard Top k operator is non-differentiable, which hinders its use in gradient-based learning frameworks. To address this, the paper introduces a differentiable Top k operator based on the sigmoid function. This operator can approximate the standard Top k operator with arbitrary precision by adjusting a temperature parameter.
The paper formulates the pruning problem as an optimization task and proposes the SMART pruning algorithm to solve it. SMART uses the differentiable Top k operator to identify the optimal pruning structure. It consists of three stages: pretraining to develop an initial model, structural searching to determine the pruning structure, and fine-tuning to enhance performance.
To prevent the algorithm from converging to non-sparse local minima, the paper introduces a dynamic temperature parameter trick. This involves continuously reducing the temperature parameter during the structural searching phase, which induces fluctuations that help the algorithm escape non-sparse solutions.
Theorems and propositions are provided to analyze the properties of the differentiable Top k operator and the effectiveness of the dynamic temperature parameter trick in facilitating sparse solutions.

The SMART pruning algorithm aims to directly solve the fundamental pruning problem, mitigating the impact of regularization bias and outperforming existing pruning methods.

Experimental Results

The paper compares a SMART pruning algorithm with state-of-the-art block, output channel, and N:M pruning schemes across various computer vision tasks and models. The authors made minor modifications to some models to optimize hardware compatibility on a top-tier Chinese autonomous driving chip. All models were trained from scratch, and detailed hyperparameters are available in the appendix.

The study benchmarked the SMART pruning algorithm against recent advancements, including PDP, ACDC, PaS, and AWG, on classification, object detection, and image segmentation tasks. The authors used Resnet50, Yolov5m, and a variant of BiSeNetv2 on ImageNet, COCO, and Cityscapes datasets, respectively. The block shape was defined as 16x8x1x1 to ensure compatibility with the target hardware.

The primary goal was to balance model accuracy and latency, using sparsity as a proxy for latency. The experimental results, summarized in Table 2, show that the SMART pruning algorithm outperforms the benchmark methods across all the evaluated models and sparsity levels (30%, 50%, and 70%). The distinct layer-wise sparsity patterns exhibited by the SMART pruner, as depicted in Figure 4, suggest that its training flow influences the cross-layer sparsity distribution, contributing to its superior performance.

Figure 4: Cross-layer sparsity distribution in block pruning. The x-axis represents the layer, the y-axis indicates sparsity, and different colors correspond to different models. The method names are shown at the top.

The provided text discusses the benefits and drawbacks of output channel pruning, which involves eliminating entire output channels in neural networks. Output channel pruning is advantageous as it reduces data I/O and can provide hardware-agnostic acceleration, meaning it can improve performance on any hardware without requiring specialized designs. However, it typically results in worse accuracy compared to other structural pruning methods at equivalent sparsity levels due to its larger granularity.

The text then compares five different pruning methods - SMART, PDP, PaS, AWG, and ACDC - across three models: Yolov5m, Resnet50, and BiSeNetv2. This comparison was conducted at output channel sparsity levels of 20% and 50%. The results demonstrate that the SMART pruning algorithm is significantly more effective than the other methods in the context of output channel pruning. The superior performance of the SMART pruner is attributed to its ability to achieve better cross-layer sparsity, which becomes more critical with the larger granularity of output channel pruning.

Figure 5: Cross-layer sparsity distribution in output channel pruning. The graph definition is the same as Figure 4.

The provided text summarizes the application of various pruning algorithms, including the SMART pruning method, to Transformer-based models such as the Swin Transformer and Vision Transformer (ViT). The experiments used a 2:4 pruning ratio, and the performance metric was Top 1 accuracy. The results show that the SMART pruning algorithm performs well on Transformer-based models in the context of N:M pruning, demonstrating the algorithm's versatility and robustness beyond convolutional neural network (CNN) architectures.

Conclusion

The paper proposes a new SMART pruner that improves block and output channel pruning. The SMART pruner ranks the importance of weights based on a learnable probability mask to better capture cross-layer correlations. It also uses a dynamic temperature parameter trick to overcome non-sparse local minima, further enhancing the pruner's effectiveness. Empirical studies show the SMART pruner outperforms existing methods across computer vision tasks and models, particularly for block and output channel pruning. Additionally, the SMART pruner demonstrates superiority in N:M pruning for Transformer-based models, showcasing its adaptability and robustness.

Appendix A Proof of Theorem 2.1

The provided text proves Theorem 2.1 by contradiction. It defines a sequence of values z_i and shows that in the limit as τ goes to 0, the sum of the sigmoid functions of the z_i values must be equal to k, a constant. However, this leads to a contradiction, so the theorem must be true.

The key steps are:

Show that the difference between adjacent z_i values goes to infinity as τ goes to 0.
Show that the sum of the sigmoid functions of the z_i values equals k.
Show that the sigmoid function of z_(N-k) goes to 0 and the sigmoid function of z_(N-k+1) goes to 1 as τ goes to 0.
This leads to a contradiction, since the sum of the sigmoid functions cannot equal k, so the theorem must be true.

There is no text provided for a summary after this section.

Appendix B Proof of Proposition 2.2

The provided text outlines the calculation of the gradient of the function f_i(x). The key steps are:

The gradient of f_i(x) with respect to x_j is derived as: d/dx_j f_i(x) = (1/τ)σ'((x_i/τ) + t(x_1, ..., x_N))*(I_[j=i] - σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N)))
The term d/dx_j t(x_1, ..., x_N) is obtained as: d/dx_j t(x_1, ..., x_N) = -σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N))
By defining the vector v = [σ'((x_1/τ) + t(x_1, ..., x_N)), ..., σ'((x_N/τ) + t(x_1, ..., x_N))]^T, the Jacobian matrix J_Top_k(x) is given as: J_Top_k(x) = (1/τ)
(diag(v) - v
v^T)

The provided equations and derivations allow for the calculation of the gradient of the function f_i(x) with respect to the input variables.

Appendix C Proof of Theorem 2.3

The provided text presents a proof by contradiction. The main idea is to assume the existence of a vector w** that has a lower loss function value than the global optimum solution (w*, m*), which contradicts the premise that (w*, m*) is the global optimum. The proof involves defining w** and m** in terms of the assumed vector w^, and then showing that L(w ⊙ Top_k(m**)) < L(w* ⊙ Top_k(m*)), which is a contradiction.

Appendix D Proof of Theorem 2.4

The proof shows that the differentiable Top-k function fᵧ(m) can approximate the standard Top-k function Top₆(m) with arbitrary precision. Specifically, there exists a τ such that for any τ in the range (0, τ̃], the difference between the two functions is less than Δ₁/Κ₁, where Δ₁ is the difference in the loss function L between the two Top-k approximations, and Κ₁ is the Lipschitz constant of L.

This allows the inequality L(w̃ ⊙ fᵧ(m̃)) - L(w* ⊙ fᵧ(m*)) ≥ Δ₁ - Δ₁ = 0 to hold. The proof concludes with this result.

Appendix E Illustration of STE Learning Delay Issue

The provided text summarizes the loss function and gradients used in typical straight-through estimator (STE)-based methods. The loss function is mathematically formulated as the minimization of a primary loss term L(w⊙I(m)) plus a regularization term λR(m), where w and m represent the weight and mask parameters, respectively, I(m) is the rounding operator that rounds the mask values to 0 or 1, and λ is a tuning parameter.

The gradients of this loss function with respect to the weights (w) and masks (m) are also derived. The gradient with respect to the weights is the gradient of the loss with respect to the effective weights (w⊙I(m)), while the gradient with respect to the masks involves the gradient of the loss with respect to the effective weights, as well as the gradient of the regularization term.

The text also notes that when the mask value is 0, the weight gradient is consistently zero, while when the mask value is 1, the weight gradient becomes equivalent to the gradient of the loss with respect to the effective weights. This behavior can lead to accuracy drops and stability issues in certain applications.

Figure 6: Illustration of STE learning delay issue.

Appendix F Extension to N:M pruning

The paper discusses how the original SMART pruning algorithm, designed for block and output channel pruning, can be modified to work with N:M pruning. The key differences between N:M pruning and block/output channel pruning are:

N:M pruning does not involve issues related to cross-layer sparsity distribution.
Since N:M pruning is a fine-grained technique, using an importance mask would lead to a mask size comparable to the weight size.

To adapt SMART pruning for N:M pruning, the original problem formulation in Equation (5) was revised, as shown in Equation (14). The main change is the removal of the importance mask parameters.

The modified algorithm, as illustrated in Algorithm 2, closely aligns with the PDP algorithm, differing in two aspects:

SMART pruning uses a soft top-k operator to generate the probability mask, while PDP uses SoftMax.
SMART pruning adopts a dynamic temperature parameter trick, while PDP uses fixed temperature parameters.

Experimental studies show that SMART pruning still outperforms PDP in N:M pruning tasks, although the margin is relatively small.

Appendix G The Accumulated Weight and Gradient (AWG) Pruning Algorithm

The Accumulated Weight and Gradient pruner is an advanced variant of the magnitude-based pruning method. It uses a combination of pre-trained weights and accumulated gradients to determine the importance of each layer in the model. The pruning process consists of three stages:

The training data is fed to the pre-trained model, and the importance scores are calculated and tracked using an exponential moving average. The least important blocks of weights are then pruned.
The model is fine-tuned on the training set for a number of epochs, with the pruning masks kept unchanged.
Finally, the model is fine-tuned for a few more epochs.

The algorithm uses several parameters, such as the number of iterative pruning steps, the fine-tuning epochs, and the decay factor for the exponential moving average. The importance score for each layer is a product of the pre-trained weight, the accumulated gradient, and a scaling factor that favors layers with high sparsity.

Appendix H Hyperparameter Settings

This section summarizes the hyperparameter settings used for different network architectures and pruning methods. The key details include:

Network architectures: ResNet50, ACDC, PAS, PDP, and Ours (SMART) for object detection, and BiSeNet v2, ViT, and Swin-Transformer for other tasks.

Batch size: Ranges from 16 to 256, depending on the network and pruning method.

Epochs: 100-200 epochs, except for BiSeNet v2 which uses 134.6 epochs.

Warm-up epochs: Varies from 0 to 20, depending on the network and pruning method.

Main optimizer and scheduler: AdamW or SGD with cosine or lambdaLR scheduling.

Mask optimizer and scheduler: Same as the main optimizer/scheduler in most cases, except for PAS which uses a separate SGD mask optimizer.

Pruning-specific parameters: Differ for each pruning method, such as maximum sparsity per layer, number of compression/decompression steps, number of search epochs, and temperature scheduling.

The provided table summarizes these hyperparameter settings in a concise manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose Alvarez

As we push the boundaries of performance in various vision tasks, the models grow in size correspondingly. To keep up with this growth, we need very aggressive pruning techniques for efficient inference and deployment on edge devices. Existing pruning approaches are limited to channel pruning and struggle with aggressive parameter reductions. In this paper, we propose a novel multi-dimensional pruning framework that jointly optimizes pruning across channels, layers, and blocks while adhering to latency constraints. We develop a latency modeling technique that accurately captures model-wide latency variations during pruning, which is crucial for achieving an optimal latency-accuracy trade-offs at high pruning ratio. We reformulate pruning as a Mixed-Integer Nonlinear Program (MINLP) to efficiently determine the optimal pruned structure with only a single pass. Our extensive results demonstrate substantial improvements over previous methods, particularly at large pruning ratios. In classification, our method significantly outperforms prior art HALP with a Top-1 accuracy of 70.0(v.s. 68.6) and an FPS of 5262 im/s(v.s. 4101 im/s). In 3D object detection, we establish a new state-of-the-art by pruning StreamPETR at a 45% pruning ratio, achieving higher FPS (37.3 vs. 31.7) and mAP (0.451 vs. 0.449) than the dense baseline.

6/19/2024

cs.CV cs.AI cs.LG

Automatic Channel Pruning for Multi-Head Attention

Eunho Lee, Youngbae Hwang

Despite the strong performance of Transformers, their quadratic computation complexity presents challenges in applying them to vision tasks. Automatic pruning is one of effective methods for reducing computation complexity without heuristic approaches. However, directly applying it to multi-head attention is not straightforward due to channel misalignment. In this paper, we propose an automatic channel pruning method to take into account the multi-head attention mechanism. First, we incorporate channel similarity-based weights into the pruning indicator to preserve more informative channels in each head. Then, we adjust pruning indicator to enforce removal of channels in equal proportions across all heads, preventing the channel misalignment. We also add a reweight module to compensate for information loss resulting from channel removal, and an effective initialization step for pruning indicator based on difference of attention between original structure and each channel. Our proposed method can be used to not only original attention, but also linear attention, which is more efficient as linear complexity with respect to the number of tokens. On ImageNet-1K, applying our pruning method to the FLattenTransformer, which includes both attention mechanisms, shows outperformed accuracy for several MACs compared with previous state-of-the-art efficient models and pruned methods. Code will be available soon.

6/3/2024

cs.CV cs.AI cs.CC

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Yang He, Joey Tianyi Zhou

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the local attention weights are compared at a global level, which may cause some locally important weights to be pruned due to their relatively small magnitude globally. The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that local attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

4/23/2024

cs.CV cs.LG

New!Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

Beatrice Alessandra Motetti, Matteo Risso, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.

7/2/2024

cs.LG