A Multi-Level Superoptimizer for Tensor Programs

Read original: arXiv:2405.05751 - Published 5/10/2024 by Mengdi Wu, Xinhao Cheng, Oded Padon, Zhihao Jia

🌐

Overview

Mirage is a multi-level superoptimizer for tensor programs, which are a type of computational model used in machine learning and other fields.
Mirage uses a novel representation called μGraphs to enable optimizations across different levels of the GPU compute hierarchy.
Mirage incorporates a pruning technique based on abstraction to navigate the large search space and provide optimality guarantees.
Mirage also introduces a probabilistic equivalence verification procedure to ensure the optimized program is equivalent to the input.
Mirage outperforms existing approaches by up to 3.5 times, even for widely used and heavily optimized deep neural networks.

Plain English Explanation

Mirage is a new tool that helps make tensor programs, which are used in machine learning, run much faster. Tensor programs are a type of computer code that performs complex mathematical calculations. Mirage works by finding ways to optimize this code to make it run more efficiently.

One of the key ideas in Mirage is a new way of representing the tensor program called "μGraphs." This allows Mirage to discover new optimizations that combine different types of transformations, like changing the mathematical operations or the way the program is scheduled to run. Towards High Performance AI Compiler Upstream MLIR

To explore all the possible ways to optimize the program, Mirage uses a technique called "pruning" that helps it focus on the most promising options. This significantly reduces the amount of searching Mirage has to do, while still guaranteeing that the final optimized program will be as good as possible. Learning Performance-Improving Code Edits

Mirage also has a way to verify that the optimized program is still doing the same thing as the original program, using a "probabilistic equivalence verification procedure." This ensures the optimized program will give the same results as the original.

When tested, Mirage was able to make tensor programs run up to 3.5 times faster than existing optimization methods, even for programs that had already been heavily optimized. This shows Mirage is a powerful tool for improving the performance of machine learning and other applications that use tensor programs.

Technical Explanation

Mirage introduces a novel multi-level superoptimizer for tensor programs, which are a core computational model used in machine learning and other fields. The key idea in Mirage is the use of "μGraphs," a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy. VisionGraph: Leveraging Large Multimodal Models with Graph Theory

This μGraph representation enables Mirage to discover novel optimizations that combine algebraic transformations, schedule transformations, and generation of new custom kernels. To efficiently navigate the large search space of possible optimizations, Mirage introduces a pruning technique based on abstraction that significantly reduces the search space while providing a certain optimality guarantee. M-HOF-OPT: Multi-Objective Hierarchical Output Fusion Optimization

To ensure that the final optimized μGraph is equivalent to the input program, Mirage incorporates a probabilistic equivalence verification procedure with strong theoretical guarantees. Proteus: Preserving Model Confidentiality during Graph Optimizations

The evaluation of Mirage shows that it outperforms existing approaches by up to 3.5 times, even for widely used and heavily optimized deep neural networks. This demonstrates the power of Mirage's multi-level optimization approach and its ability to discover significant performance improvements.

Critical Analysis

The paper provides a thorough technical explanation of the Mirage system and its novel contributions. However, it does not extensively discuss potential limitations or areas for further research.

One potential limitation is the reliance on the μGraph representation, which may not capture all possible optimization opportunities across the different levels of the GPU compute hierarchy. Additionally, the pruning technique, while effective, could potentially miss some optimizations if the abstraction is not sufficiently precise.

The paper also does not address how Mirage would handle tensor programs with more complex control flow or data dependencies, which could pose additional challenges for the optimization and verification procedures.

Further research could explore ways to extend Mirage to handle a wider range of tensor program structures, investigate alternative optimization and verification techniques, and assess the system's scalability and robustness on larger, more diverse benchmarks.

Conclusion

Mirage is a groundbreaking multi-level superoptimizer that significantly advances the state of the art in optimizing tensor programs, a critical component of modern machine learning and scientific computing. By introducing the novel μGraph representation and leveraging it to discover novel optimizations, Mirage is able to deliver performance improvements of up to 3.5 times compared to existing approaches.

The strong theoretical foundations and rigorous verification procedures in Mirage ensure that the optimized programs remain functionally equivalent to the original, providing confidence in the safety and reliability of the optimized computations. As machine learning and scientific computing continue to grow in importance, tools like Mirage will become increasingly valuable in driving performance and efficiency improvements across a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

A Multi-Level Superoptimizer for Tensor Programs

Mengdi Wu, Xinhao Cheng, Oded Padon, Zhihao Jia

We introduce Mirage, the first multi-level superoptimizer for tensor programs. A key idea in Mirage is $mu$Graphs, a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy. $mu$Graphs enable Mirage to discover novel optimizations that combine algebraic transformations, schedule transformations, and generation of new custom kernels. To navigate the large search space, Mirage introduces a pruning technique based on abstraction that significantly reduces the search space and provides a certain optimality guarantee. To ensure that the optimized $mu$Graph is equivalent to the input program, Mirage introduces a probabilistic equivalence verification procedure with strong theoretical guarantees. Our evaluation shows that Mirage outperforms existing approaches by up to 3.5$times$ even for DNNs that are widely used and heavily optimized. Mirage is publicly available at https://github.com/mirage-project/mirage.

5/10/2024

Mirage: An RNS-Based Photonic Accelerator for DNN Training

Cansu Demirkiran, Guowei Yang, Darius Bunandar, Ajay Joshi

Photonic computing is a compelling avenue for performing highly efficient matrix multiplication, a crucial operation in Deep Neural Networks (DNNs). While this method has shown great success in DNN inference, meeting the high precision demands of DNN training proves challenging due to the precision limitations imposed by costly data converters and the analog noise inherent in photonic hardware. This paper proposes Mirage, a photonic DNN training accelerator that overcomes the precision challenges in photonic hardware using the Residue Number System (RNS). RNS is a numeral system based on modular arithmetic, allowing us to perform high-precision operations via multiple low-precision modular operations. In this work, we present a novel micro-architecture and dataflow for an RNS-based photonic tensor core performing modular arithmetic in the analog domain. By combining RNS and photonics, Mirage provides high energy efficiency without compromising precision and can successfully train state-of-the-art DNNs achieving accuracy comparable to FP32 training. Our study shows that on average across several DNNs when compared to systolic arrays, Mirage achieves more than $23.8times$ faster training and $32.1times$ lower EDP in an iso-energy scenario and consumes $42.8times$ lower power with comparable or better EDP in an iso-area scenario.

5/27/2024

🤖

Towards a high-performance AI compiler with upstream MLIR

Renato Golin, Lorenzo Chelini, Adam Siemieniuk, Kavitha Madhu, Niranjan Hasabnis, Hans Pabst, Evangelos Georganas, Alexander Heinecke

This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.

4/24/2024

🌀

Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning

Liang Qiao, Jun Shi, Xiaoyu Hao, Xi Fang, Minfan Zhao, Ziqi Zhu, Junshi Chen, Hong An, Bing Li, Honghui Yuan, Xinyang Wang, Xulong Tang

Tensor program tuning is essential for the efficient deployment of deep neural networks. Search-based approaches have demonstrated scalability and effectiveness in automatically finding high-performance programs for specific hardware. However, the search process is often inefficient, taking hours or even days to discover optimal programs due to the exploration mechanisms guided by an accurate but slow learned cost model. Meanwhile, the learned cost model trained on one platform cannot seamlessly adapt online to another, which we call cross-platform online unawareness. In this work, we propose Pruner and MoA-Pruner. Pruner is a speculative exploration mechanism that accelerates the search process using a Draft-then-Verify paradigm. Instead of applying the complex learned cost model to all explored candidates, Pruner drafts small-scale speculative candidates by introducing a naive symbol analyzer (draft model), then identifies the best candidates by the learned cost model. MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness. We incorporate these techniques into the Ansor and conduct extensive experiments on three GPU-based platforms. Results show that in online cost model tuning scenarios, Pruner and MoA-Pruner can achieve an average speedup of $2.6 times$ and $4.82 times$ compared to Ansor. In offline tuning scenarios, Pruner can achieve an average speedup of $4.75 times$ and $4.05times$ compared to TenSet and TLP, respectively. The code is available at https://github.com/qiaolian9/Pruner.

7/2/2024