Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

2405.17322

Published 5/28/2024 by L. A. Torres, Carlos J. Barrios H, Yves Denneulin

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Abstract

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have lower performance but better precision. The study compares execution times and power consumption using PAPI and PERF and compares accuracy for different matrix sizes. Comparisons were made on architectures such as third and fourth-generation Intel CPUs and NVIDIA V100 and A100 GPUs. The MKL library showed the best performance with a slight loss of precision, while OpenMP and SYCL on the CPU implementation showed the best accuracy but a loss of performance. On the other hand, the results on GPU showed that cuBLAS with tensor cores had the best performance; however, it had a cost in accuracy. The cuBLAS library without these specialized cores shows minimal performance loss and much higher accuracy. The data obtained on different architectures showed that the CPU could achieve performance close to that obtained on the GPU with increased power consumption. These results are conditional on certain hardware specifications, such as the number of cores, clock frequency, processor generation for the CPU, and the speed and bandwidth of the PCI bus and device architecture (compute capability) for the GPU.

Create account to get full access

Overview

Evaluates the computational and power performance of different matrix multiplication methods and libraries on both CPU and GPU
Compares the performance of Intel's Math Kernel Library (MKL), NVIDIA's cuBLAS library, and Intel's SYCL implementation
Measures the computational speed and power consumption for different matrix sizes and hardware configurations

Plain English Explanation

This paper aims to understand the tradeoffs between computational speed and power consumption when performing matrix multiplication on different hardware platforms and using various software libraries. Matrix multiplication is a fundamental operation in many fields, from machine learning to scientific computing, and the choice of implementation can have a significant impact on performance and energy efficiency.

The researchers evaluated three popular matrix multiplication solutions: Intel's Math Kernel Library (MKL), NVIDIA's cuBLAS library, and Intel's SYCL implementation. They tested these libraries on both CPU and GPU hardware, measuring the computational speed and power consumption for a range of matrix sizes.

The results provide insights into the strengths and weaknesses of each approach. For example, the SYCL implementation may offer better energy efficiency for certain workloads, while the GPU-accelerated cuBLAS library can deliver superior computational performance. These tradeoffs can help developers choose the most appropriate solution for their specific needs, whether that's maximizing speed, minimizing power consumption, or finding the right balance between the two.

Technical Explanation

The researchers conducted a series of experiments to evaluate the performance and power consumption of different matrix multiplication methods and libraries on both CPU and GPU hardware. They used Intel's Math Kernel Library (MKL), NVIDIA's cuBLAS library, and Intel's SYCL implementation as the primary matrix multiplication solutions.

The experiments were designed to measure the computational speed, in terms of GFLOPS (Giga Floating-Point Operations per Second), and the power consumption, in Watts, for a range of matrix sizes. The matrix dimensions were varied from small (512x512) to large (4096x4096) to capture the performance characteristics across different workload sizes.

The CPU hardware used in the experiments was an Intel Xeon Gold 6248R processor, while the GPU was an NVIDIA A100 Tensor Core GPU. The power measurements were obtained using specialized power monitoring equipment to ensure accurate readings.

The results showed that the GPU-accelerated cuBLAS library consistently outperformed the CPU-based MKL and SYCL implementations in terms of computational speed, especially for larger matrix sizes. However, the SYCL implementation demonstrated better energy efficiency, consuming less power than the other solutions for certain workloads.

These findings highlight the importance of considering both computational performance and power consumption when choosing the appropriate matrix multiplication solution for a given application. The trade-offs between speed and energy efficiency can be critical in domains such as energy-constrained embedded systems or large-scale data centers, where both computational capability and power efficiency are crucial.

Critical Analysis

The paper provides a comprehensive evaluation of matrix multiplication performance and power consumption across different hardware and software configurations. However, it is important to note that the results may be specific to the particular hardware and software versions used in the experiments.

As the researchers mention, the performance of these matrix multiplication solutions can be influenced by factors such as compiler optimizations, hardware microarchitecture, and software library updates. Therefore, the findings of this study may not necessarily generalize to all hardware and software combinations, and users should carefully evaluate the performance and power characteristics of these libraries in their own specific environments.

Additionally, the paper does not explore the potential impact of advanced techniques, such as GPU memory management optimizations or stencil computation parallelization, which could further improve the performance and energy efficiency of matrix multiplication on GPU and CPU platforms.

Conclusion

This paper provides a valuable comparison of the computational and power performance of different matrix multiplication methods and libraries on both CPU and GPU hardware. The results highlight the trade-offs between speed and energy efficiency that developers and researchers must consider when choosing the appropriate matrix multiplication solution for their applications.

The findings can help guide the selection of matrix multiplication implementations based on the specific requirements of a given use case, whether it's maximizing computational throughput or minimizing power consumption. This knowledge can be particularly useful in domains like machine learning, scientific computing, and energy-constrained embedded systems, where both performance and energy efficiency are critical.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🚀

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models

L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic

For many years, systems running Nvidia-based GPU architectures have dominated the heterogeneous supercomputer landscape. However, recently GPU chipsets manufactured by Intel and AMD have cut into this market and can now be found in some of the worlds fastest supercomputers. The June 2023 edition of the TOP500 list of supercomputers ranks the Frontier supercomputer at the Oak Ridge National Laboratory in Tennessee as the top system in the world. This system features AMD Instinct 250 X GPUs and is currently the only true exascale computer in the world.The first framework that enabled support for heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009. Since then a number of frameworks have been developed to support vendor agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL. SYCL, which combines the concepts of OpenCL with the flexibility of single-source C++, is one of the more promising programming models for heterogeneous computing devices. One key advantage of this framework is that it provides a higher-level programming interface that abstracts away many of the hardware details than the other frameworks. This makes SYCL easier to learn and to maintain across multiple architectures and vendors. In n recent years, there has been growing interest in using heterogeneous computing architectures to accelerate molecular dynamics simulations. Some of the more popular molecular dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of our knowledge, only Gromacs has been successfully ported to SYCL to date. In this paper, we compare the performance of GROMACS compiled using the SYCL and CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we compare its performance across three different Nvidia GPU chipsets, P100, V100, and A100.

6/18/2024

cs.DC cs.PF

📊

GPU Implementations for Midsize Integer Addition and Multiplication

Cosmin E. Oancea, Stephen M. Watt

This paper explores practical aspects of using a high-level functional language for GPU-based arithmetic on ``midsize'' integers. By this we mean integers of up to about a quarter million bits, which is sufficient for most practical purposes. The goal is to understand whether it is possible to support efficient nested-parallel programs with a small, flexible code base. We report on GPU implementations for addition and multiplication of integers that fit in one CUDA block, thus leveraging temporal reuse from scratchpad memories. Our key contribution resides in the simplicity of the proposed solutions: We recognize that addition is a straightforward application of scan, which is known to allow efficient GPU implementation. For quadratic multiplication we employ a simple work-partitioning strategy that offers good temporal locality. For FFT multiplication, we efficiently map the computation in the domain of integral fields by finding ``good'' primes that enable almost-full utilization of machine words. In comparison, related work uses complex tiling strategies -- which feel too big a hammer for the job -- or uses the computational domain of reals, which may degrade the magnitude of the base in which the computation is carried. We evaluate the performance in comparison to the state-of-the-art CGBN library, authored by NvidiaLab, and report that our CUDA prototype outperforms CGBN for integer sizes higher than 32K bits, while offering comparable performance for smaller sizes. Moreover, we are, to our knowledge, the first to report that FFT multiplication outperforms the classical one on the larger sizes that still fit in a CUDA block. Finally, we examine Futhark's strengths and weaknesses for efficiently supporting such computations and find out that a compiler pass aimed at efficient sequentialization of excess parallelism would significantly improve performance.

5/24/2024

cs.DC cs.MS cs.PL

Make Inference Faster: Efficient GPU Memory Management for Butterfly Sparse Matrix Multiplication

Antoine Gonon, L'eon Zheng, Pascal Carrivain, Quoc-Tung Le

This paper is the first to assess the state of existing sparse matrix multiplication algorithms on GPU for the butterfly structure, a promising form of sparsity. This is achieved through a comprehensive benchmark that can be easily modified to add a new implementation. The goal is to provide a simple tool for users to select the optimal implementation based on their settings. Using this benchmark, we find that existing implementations spend up to 50% of their total runtime on memory rewriting operations. We show that these memory operations can be optimized by introducing a new CUDA kernel that minimizes the transfers between the different levels of GPU memory, achieving a median speed-up factor of x1.4 while also reducing energy consumption (median of x0.85). We also demonstrate the broader significance of our results by showing how the new kernel can speed up the inference of neural networks.

5/27/2024

cs.LG

WWW: What, When, Where to Compute-in-Memory

Tanvi Sharma, Mustafa Ali, Indranil Chakraborty, Kaushik Roy

Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.

6/21/2024

cs.AR cs.DC cs.LG