Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning

Read original: arXiv:2402.02361 - Published 7/2/2024 by Liang Qiao, Jun Shi, Xiaoyu Hao, Xi Fang, Minfan Zhao, Ziqi Zhu, Junshi Chen, Hong An, Bing Li, Honghui Yuan and 2 others

🌀

Overview

Tensor program tuning is crucial for efficiently deploying deep neural networks.
Search-based approaches can automatically find high-performance programs for specific hardware, but the search process is often inefficient.
This work proposes Pruner and [MoA-Pruner], which aim to accelerate the search process and address cross-platform online unawareness.

Plain English Explanation

To run deep learning models efficiently on different hardware devices, the code needs to be optimized. Pruner and [MoA-Pruner] are techniques that can speed up this optimization process.

Normally, the optimization process involves a slow, complex model that evaluates many potential code changes. Pruner introduces a faster, simpler model to quickly evaluate some potential changes, and then the slow, complex model is used to identify the best ones.

[MoA-Pruner] also addresses a problem where the slow, complex model trained on one hardware platform doesn't work as well on another platform. It can adapt the model to work better on different platforms.

These techniques can make the optimization process 2.6 to 4.82 times faster, which means deep learning models can be deployed more quickly on different hardware.

Technical Explanation

The paper proposes two techniques to improve tensor program tuning:

Pruner is a speculative exploration mechanism that uses a fast, simple "draft model" to quickly evaluate potential code changes, and then applies the slow, complex learned cost model to identify the best candidates.

[MoA-Pruner] introduces "Momentum online Adaptation" to address the issue of the learned cost model not seamlessly adapting to different hardware platforms (cross-platform online unawareness).

The authors incorporate these techniques into the Ansor tensor program optimizer and evaluate them on three GPU-based platforms. They find that Pruner and [MoA-Pruner] can achieve 2.6x and 4.82x speedups respectively compared to the baseline Ansor approach in online cost model tuning scenarios. In offline tuning, Pruner achieves 4.75x and 4.05x speedups compared to TenSet and TLP.

Critical Analysis

The paper addresses an important challenge in deploying deep learning models efficiently on different hardware. The proposed techniques, Pruner and [MoA-Pruner], show promising results in accelerating the optimization process.

However, the paper does not provide much detail on the limitations of the approaches. It would be helpful to understand the types of workloads or hardware configurations where the techniques may not work as well, or if there are any potential downsides to the "Draft-then-Verify" strategy employed by Pruner.

Additionally, the paper could have explored the generalizability of the techniques beyond the specific Ansor optimizer used in the experiments. It would be interesting to see how Pruner and [MoA-Pruner] could be applied or adapted to other tensor program optimization approaches, such as ONNX Pruner, LoRAPruner, Structural Pruning, or MorEaUPruner.

Conclusion

The paper presents Pruner and [MoA-Pruner], two techniques that can significantly accelerate the optimization of tensor programs for deep neural networks. By using a faster "draft model" and addressing cross-platform adaptation issues, these approaches can speed up the deployment of efficient deep learning models on a variety of hardware platforms. While the paper could have explored some limitations and broader applicability, the proposed methods demonstrate a promising direction for improving the efficiency of deep learning deployments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning

Liang Qiao, Jun Shi, Xiaoyu Hao, Xi Fang, Minfan Zhao, Ziqi Zhu, Junshi Chen, Hong An, Bing Li, Honghui Yuan, Xinyang Wang, Xulong Tang

Tensor program tuning is essential for the efficient deployment of deep neural networks. Search-based approaches have demonstrated scalability and effectiveness in automatically finding high-performance programs for specific hardware. However, the search process is often inefficient, taking hours or even days to discover optimal programs due to the exploration mechanisms guided by an accurate but slow learned cost model. Meanwhile, the learned cost model trained on one platform cannot seamlessly adapt online to another, which we call cross-platform online unawareness. In this work, we propose Pruner and MoA-Pruner. Pruner is a speculative exploration mechanism that accelerates the search process using a Draft-then-Verify paradigm. Instead of applying the complex learned cost model to all explored candidates, Pruner drafts small-scale speculative candidates by introducing a naive symbol analyzer (draft model), then identifies the best candidates by the learned cost model. MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness. We incorporate these techniques into the Ansor and conduct extensive experiments on three GPU-based platforms. Results show that in online cost model tuning scenarios, Pruner and MoA-Pruner can achieve an average speedup of $2.6 times$ and $4.82 times$ compared to Ansor. In offline tuning scenarios, Pruner can achieve an average speedup of $4.75 times$ and $4.05times$ compared to TenSet and TLP, respectively. The code is available at https://github.com/qiaolian9/Pruner.

7/2/2024

PAT: Pruning-Aware Tuning for Large Language Models

Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du

Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33$times$ speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost. Code: https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning

8/28/2024

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu

Despite the remarkable capabilities, Large Language Models (LLMs) face deployment challenges due to their extensive size. Pruning methods drop a subset of weights to accelerate, but many of them require retraining, which is prohibitively expensive and computationally demanding. Recently, post-training pruning approaches introduced novel metrics, enabling the pruning of LLMs without retraining. However, these metrics require the involvement of human experts and tedious trial and error. To efficiently identify superior pruning metrics, we develop an automatic framework for searching symbolic pruning metrics using genetic programming. In particular, we devise an elaborate search space encompassing the existing pruning metrics to discover the potential symbolic pruning metric. We propose an opposing operation simplification strategy to increase the diversity of the population. In this way, Pruner-Zero allows auto-generation of symbolic pruning metrics. Based on the searched results, we explore the correlation between pruning metrics and performance after pruning and summarize some principles. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate that our Pruner-Zero obtains superior performance than SOTA post-training pruning methods. Code at: url{https://github.com/pprp/Pruner-Zero}.

6/6/2024

ONNXPruner: ONNX-Based General Model Pruning Adapter

Dongdong Ren, Wenbin Li, Tianyu Ding, Lei Wang, Qi Fan, Jing Huo, Hongbing Pan, Yang Gao

Recent advancements in model pruning have focused on developing new algorithms and improving upon benchmarks. However, the practical application of these algorithms across various models and platforms remains a significant challenge. To address this challenge, we propose ONNXPruner, a versatile pruning adapter designed for the ONNX format models. ONNXPruner streamlines the adaptation process across diverse deep learning frameworks and hardware platforms. A novel aspect of ONNXPruner is its use of node association trees, which automatically adapt to various model architectures. These trees clarify the structural relationships between nodes, guiding the pruning process, particularly highlighting the impact on interconnected nodes. Furthermore, we introduce a tree-level evaluation method. By leveraging node association trees, this method allows for a comprehensive analysis beyond traditional single-node evaluations, enhancing pruning performance without the need for extra operations. Experiments across multiple models and datasets confirm ONNXPruner's strong adaptability and increased efficacy. Our work aims to advance the practical application of model pruning.

4/15/2024