AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Read original: arXiv:2208.08088 - Published 8/19/2024 by Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, Yunquan Zhang

🛸

Overview

General matrix-matrix multiplication with non-regular-shaped input matrices is widely used in applications like deep learning.
Conventional implementations are not well-suited for non-regular-shaped matrix-matrix multiplications.
Few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs.

Plain English Explanation

The paper proposes an auto-tuning framework called AutoTSMM to build high-performance tall-and-skinny matrix-matrix multiplication. Tall-and-skinny matrices have more rows than columns. AutoTSMM selects the best inner kernels at install-time and generates an execution plan at runtime for the matrix multiplication.

The experiments show that AutoTSMM achieves competitive performance compared to state-of-the-art tall-and-skinny matrix-matrix multiplication techniques. It also outperforms conventional matrix-matrix multiplication implementations.

Technical Explanation

The paper introduces an auto-tuning framework called AutoTSMM to optimize tall-and-skinny matrix-matrix multiplication on CPUs. Tall-and-skinny matrices have more rows than columns, which is common in applications like deep learning.

AutoTSMM has two stages:

Install-time: It selects the optimal inner kernels for the tall-and-skinny matrix-matrix multiplication.
Runtime: It generates an execution plan for the pre-packed tall-and-skinny matrix-matrix multiplication.

The authors compare the performance of AutoTSMM against state-of-the-art tall-and-skinny matrix-matrix multiplication techniques as well as conventional matrix-matrix multiplication implementations. The results show that AutoTSMM achieves competitive performance and outperforms the conventional approaches.

Critical Analysis

The paper provides a promising solution for optimizing tall-and-skinny matrix-matrix multiplication on CPUs, which is an important operation in many applications. However, the authors do not discuss any potential limitations or caveats of their approach.

For example, it would be helpful to know how AutoTSMM performs on a wider range of matrix sizes and shapes, or how it compares to GPU-based matrix multiplication techniques. Additionally, the paper does not explore the impact of AutoTSMM on power consumption or energy efficiency, which are also crucial factors in real-world deployments.

Further research could investigate the generalization of AutoTSMM to other types of non-regular-shaped matrix operations or the integration of AutoTSMM with higher-level frameworks for end-to-end model optimization.

Conclusion

The paper presents an auto-tuning framework called AutoTSMM that achieves competitive performance for tall-and-skinny matrix-matrix multiplication on CPUs, outperforming conventional approaches. This is an important contribution to optimizing key operations in applications like deep learning. While the paper does not explore all potential limitations, it provides a solid foundation for future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, Yunquan Zhang

In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.

8/19/2024

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.

8/23/2024

An Open-Source Framework for Efficient Numerically-Tailored Computations

Louis Ledoux, Marc Casas

We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic datapath generation, enabling highly customizable systolic MMM kernels; second, seamless integration of the generated kernels into user code, irrespective of the programming language employed, without necessitating modifications. The framework demonstrates a systematic enhancement in accuracy per energy cost across diverse High Performance Computing (HPC) workloads displaying a variety of numerical requirements, such as Artificial Intelligence (AI) inference and Sea Surface Height (SSH) computation. For AI inference, we consider a set of state-of-the-art neural network models, namely ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11, in conjunction with two datasets, two computer formats, and 27 distinct intermediate arithmetic datapaths. Our approach consistently reduces energy consumption across all cases, with a notable example being the reduction by factors of $3.3times$ for IEEE754-32 and $1.4times$ for Bfloat16 during ImageNet inference with ResNet50. This is accomplished while maintaining accuracies of $82.3%$ and $86%$, comparable to those achieved with conventional Floating-Point Units (FPUs). In the context of SSH computation, our method achieves fully-reproducible results using double-precision words, surpassing the accuracy of conventional double- and quad-precision arithmetic in FPUs. Our approach enhances SSH computation accuracy by a minimum of $5times$ and $27times$ compared to IEEE754-64 and IEEE754-128, respectively, resulting in $5.6times$ and $15.1times$ improvements in accuracy per power cost.

6/6/2024

High Performance Unstructured SpMM Computation Using Tensor Cores

Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler

High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large-model training, inference, and others.

8/22/2024