Optimal Kernel Orchestration for Tensor Programs with Korch

Read original: arXiv:2406.09465 - Published 6/17/2024 by Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

Optimal Kernel Orchestration for Tensor Programs with Korch

Overview

• This research paper introduces Korch, a novel system for optimizing the performance of tensor programs, which are a key component of machine learning models.

• Korch addresses the challenge of kernel orchestration, which is the process of selecting and configuring the most efficient low-level computing kernels for executing tensor operations.

• The paper presents Korch's approach to kernel orchestration, which combines static analysis, machine learning, and heuristic optimization to automatically tune tensor programs for high performance.

• The authors evaluate Korch's effectiveness on a range of real-world tensor programs, demonstrating significant performance improvements over state-of-the-art alternatives.

Plain English Explanation

Machine learning models like neural networks are built on a fundamental building block called tensor operations. These tensor operations can be very computationally intensive, so it's important to make them as fast and efficient as possible.

The key challenge is kernel orchestration - choosing the right low-level computing kernels (small, optimized pieces of code) to execute each tensor operation. This is a complex optimization problem, as there are many possible kernel configurations and the optimal choice depends on the specific tensor program and the hardware it's running on.

Korch is a new system that tackles this problem. It uses a combination of static code analysis, machine learning, and heuristic optimization to automatically select and tune the best kernel configurations for a given tensor program. This allows Korch to significantly improve the performance of tensor programs compared to existing approaches.

By making tensor programs more efficient, Korch could help make machine learning models faster and more capable. This could have wide-ranging impacts, from enabling new AI capabilities to improving the energy efficiency of machine learning systems.

Technical Explanation

At the heart of Korch is a kernel orchestration algorithm that combines several key techniques:

Static code analysis: Korch analyzes the structure and dependencies of the input tensor program to extract relevant features for kernel selection.
Machine learning: Korch trains a neural network model to predict the optimal kernel configuration for a given tensor program and hardware setup, based on the extracted features.
Heuristic optimization: Korch uses an efficient search algorithm to explore the space of possible kernel configurations, guided by the predictions of the machine learning model.

The authors evaluate Korch on a diverse set of real-world tensor programs, running on both CPUs and GPUs. They show that Korch can deliver performance improvements of up to 2.5x over state-of-the-art alternatives, while also reducing compilation time.

One key innovation in Korch is its ability to handle complex tensor programs with irregular data access patterns and heterogeneous hardware. This is achieved through Korch's advanced static analysis and machine learning capabilities.

Critical Analysis

The authors acknowledge that Korch's performance gains are highly dependent on the quality and coverage of the training data used to build the machine learning model. They also note that Korch's compilation time, while faster than some alternatives, may still be a bottleneck for certain applications.

Additionally, the paper does not provide a detailed comparison of Korch's performance to manually tuned kernel configurations, which may represent a practical upper bound on the achievable gains.

Further research could investigate ways to improve the robustness and generalization of Korch's machine learning models, as well as explore the application of Korch to a wider range of tensor-based workloads, such as distributed training of large-scale machine learning models.

Conclusion

The Korch system represents a significant advance in the field of tensor program optimization, demonstrating the power of combining static analysis, machine learning, and heuristic optimization to automatically tune the performance of these critical building blocks of modern machine learning systems.

By improving the efficiency of tensor programs, Korch has the potential to unlock new capabilities in AI, while also contributing to the energy efficiency and sustainability of machine learning infrastructure. As the field of machine learning continues to evolve, innovative systems like Korch will play an increasingly important role in enabling the next generation of intelligent technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Optimal Kernel Orchestration for Tensor Programs with Korch

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

6/17/2024

Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Michael Canesche, Gaurav Verma, Fernando Magno Quintao Pereira

Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.

7/16/2024

Global Optimizations & Lightweight Dynamic Logic for Concurrency

Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair

Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software support for GEMMs, their kernel implementations and optimizations typically assume each kernel executes in isolation and can utilize all GPU resources. This approach is highly efficient when kernels execute in isolation, but causes significant resource contention and slowdowns when kernels execute concurrently. Moreover, current approaches often only statically expose and control parallelism within an application, without considering runtime information such as varying input size and concurrent applications -- often exacerbating contention. These issues limit performance benefits from concurrently executing independent operations. Accordingly, we propose GOLDYLOC, which considers the global resources across all concurrent operations to identify performant GEMM kernels, which we call globally optimized (GO)-Kernels. Moreover, GOLDYLOC introduces a lightweight dynamic logic which considers the dynamic execution environment for available parallelism and input sizes to execute performant combinations of concurrent GEMMs on the GPU. Overall, GOLDYLOC improves performance of concurrent GEMMs on a real GPU by up to 2$times$ (18% geomean per workload) and provides up to 2.5$times$ (43% geomean per workload) speedups over sequential execution.

9/5/2024

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

Khawir Mahmood, Jehandad Khan, Hammad Afzal

GPU kernels have come to the forefront of comput- ing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical application, which makes their performance highly critical. Due to the unknown nature of the optimization surface, an exhaustive search is required to discover the global optimum, which is infeasible due to the possible exponential number of parameter combinations. In this work, we propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels. This work considers the prediction of kernel parameters as a sequence to the sequence translation problem, borrowing models from the Natural Language Process- ing (NLP) domain. Parameters describing the input, output and weight tensors are considered as the input language to the model that emits the corresponding kernel parameters. In essence, the model translates the problem parameter language to kernel parameter language. The core contributions of this work are: a) Proposing that a sequence to sequence model can accurately learn the performance dynamics of a GPU compute kernel b) A novel network architecture which predicts the kernel tuning parameters for GPU kernels, c) A constrained beam search which incorporates the physical limits of the GPU hardware as well as other expert knowledge reducing the search space. The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library. As a result, the proposed technique can reduce the development time and compute resources required to tune unseen input configurations, resulting in shorter development cycles, reduced development costs, and better user experience.

4/17/2024