Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Read original: arXiv:2405.12525 - Published 5/24/2024 by Dane C. Lacey, Christie L. Alappat, Florian Lange, Georg Hager, Holger Fehske, Gerhard Wellein

🏋️

Overview

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes due to the strain on memory and irregular access patterns
Repeating these products multiple times with the same matrix is required in many algorithms, creating an opportunity for data reuse
The Recursive Algebraic Coloring Engine (RACE) has successfully exploited this opportunity for shared memory parallelization, but enabling cache blocking for distributed-memory parallelization is challenging
This work proposes a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme to enable distributed-memory parallelization of the matrix power kernel (MPK)

Plain English Explanation

Sparse matrix-vector products are a common operation in many scientific simulations and algorithms, but they can be slow because the way the data is stored in memory doesn't match how the computations need to access it. This causes a lot of strain on the memory system.

Many of these algorithms need to repeat the sparse matrix-vector product multiple times using the same matrix. This means the same matrix data is being loaded from memory over and over, which presents an opportunity to reuse that data and make the overall computation more efficient.

A previous technique called the Recursive Algebraic Coloring Engine (RACE) has been able to take advantage of this data reuse for computations running on a single machine with shared memory. However, extending this to computations running across multiple machines with separate memories (distributed memory) is challenging, because the machines need to coordinate and share the data they each need.

This new work proposes a flexible method that combines RACE's cache-blocking techniques with a communication scheme using the Message Passing Interface (MPI) to enable distributed-memory parallelization of the matrix power kernel. This allows them to achieve substantial speed-ups on modern hardware compared to traditional distributed-memory approaches.

Technical Explanation

The authors propose a new method for parallelizing the matrix power kernel (MPK), which involves repeatedly performing sparse matrix-vector products (SpMVs) with the same sparse matrix. Previous work has shown that RACE, a technique for exploiting data reuse in SpMVs, can provide significant performance improvements in shared-memory parallelization.

However, extending RACE to distributed-memory parallelization is challenging due to the need for explicit communication and synchronization of data between compute nodes. The authors address this by interleaving RACE's cache-blocking capabilities with an MPI communication scheme that fulfills the data dependencies between processes.

Their "Distributed Level-Blocked MPK" method partitions the matrix into levels that can be processed independently, with communication only required between neighboring levels. This allows for efficient cache utilization and data reuse within each process, while also enabling the necessary data exchanges between processes.

The authors evaluate their method on a variety of sparse matrices from scientific applications, demonstrating substantial speed-ups over traditional distributed-memory parallel MPK approaches on modern Intel and AMD architectures. They also apply their technique to a quantum physics problem, achieving up to a 4x speed-up on 832 cores of an Intel Sapphire Rapids cluster.

Critical Analysis

The authors have successfully addressed a key challenge in enabling distributed-memory parallelization of the matrix power kernel, which is an important operation in many scientific algorithms. By combining RACE's cache-blocking techniques with an MPI-based communication scheme, they have created a flexible and efficient method that can take advantage of data reuse while also handling the required data dependencies across compute nodes.

One potential limitation of the work is that the communication and synchronization overhead may still be significant, especially for matrices with irregular sparsity patterns that result in unbalanced workloads across processes. The authors acknowledge this and suggest further research into dynamic load balancing strategies.

Additionally, the performance of the method may be sensitive to the specific characteristics of the target hardware, such as the memory hierarchy and network topology. The authors have evaluated their approach on a range of modern Intel and AMD systems, but it would be valuable to see how it scales on other architectures, such as GPU-accelerated or specialized hardware.

Overall, this work represents an important step forward in enabling efficient distributed-memory parallelization of the matrix power kernel, which could have significant implications for a wide range of scientific computing applications.

Conclusion

This paper presents a flexible and efficient method for parallelizing the matrix power kernel (MPK) on distributed-memory systems. By combining the cache-blocking capabilities of the Recursive Algebraic Coloring Engine (RACE) with an MPI-based communication scheme, the authors have addressed a key challenge in enabling data reuse across compute nodes.

The proposed "Distributed Level-Blocked MPK" approach has been shown to deliver substantial performance improvements over traditional distributed-memory parallel MPK methods, with speed-ups of up to 4x demonstrated on a quantum physics problem. This work could have far-reaching impacts, as the MPK is a crucial operation in many scientific algorithms and simulations.

While the method has some limitations, such as potential sensitivity to irregular sparsity patterns and hardware characteristics, the authors have laid the groundwork for further research and optimization. As high-performance computing continues to evolve, techniques like this that can efficiently leverage both shared and distributed-memory parallelism will become increasingly important for advancing the state of the art in scientific computing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Dane C. Lacey, Christie L. Alappat, Florian Lange, Georg Hager, Holger Fehske, Gerhard Wellein

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a traditional distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.

5/24/2024

🗣️

SABLE: Staging Blocked Evaluation of Sparse Matrix Computations

Pratyush Das, Adhitha Dias, Anxhelo Xhebraj, Artem Pelenitsyn, Kirshanthan Sundararajah, Milind Kulkarni

Sparse Matrices found in the real world often have some structure in how the dense elements are organized. While the inspector-executor model inspects matrices for structure, its generality can overlook further specialization. We propose a system that - if the sparse matrix is stored in a blocked storage format - can generate more efficient code by constructing regular loops over these blocks. Our system performs a specified computation over every element of the block instead of avoiding computing any sparse element at all and achieving regularity in specialized code. The system is extensible, providing a dense block iterator for the user to express any computation over these dense blocks. We show that this approach can significantly speed up SpMV and SpMM operations over the state-of-the-art systems Partially-Strided Codelets and Sparse Register Tiling.

7/2/2024

PARS3: Parallel Sparse Skew-Symmetric Matrix-Vector Multiplication with Reverse Cuthill-McKee Reordering

Selin Yildirim, Murat Manguoglu

Sparse matrices, as prevalent primitive of various scientific computing algorithms, persist as a bottleneck in processing. A skew-symmetric matrix flips signs of symmetric pairs in a symmetric matrix. Our work, Parallel 3-Way Banded Skew-Symmetric Sparse Matrix-Vector Multiplication, equally improves parallel symmetric SpMV kernels with a different perspective than the common literature trends, by manipulating the form of matrix in a preprocessing step to accelerate the repeated computations of iterative solvers. We effectively use Reverse Cuthill-McKee (RCM) reordering algorithm to transform a sparse skew-symmetrix matrix into a band matrix, then efficiently parallelize it by splitting the band structure into 3 different parts by considering its local sparsity. Our proposed method with RCM is novel in the sense that it is the first implementation of parallel skew-symmetric SpMV kernels. Our enhancements in SpMV and findings are valuable with significant strong scalings of up to 19x over the serial compressed SpMV implementation. We overperform a heuristic-based graph-coloring approach with synchronization phases in implementing parallel symmetric SpMVs. Our approach also naturally applies to parallel sparse symmetric SpMVs, that can inspire widespread SpMV solutions to adapt presented optimizations in this paper.

7/26/2024

❗

Fast multiplication of random dense matrices with fixed sparse matrices

Tianyu Liang, Riley Murray, Ayd{i}n Buluc{c}, James Demmel

This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm's parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to approximately 45%. We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of $sqrt M$, where $M$ is the cache size. Finally, we incorporate our sketching algorithm into a randomized least squares solver. For extremely over-determined sparse input matrices, we show that our results are competitive with SuiteSparse; in some cases, we obtain a speedup of 10x over SuiteSparse.

5/14/2024