GPU Implementations for Midsize Integer Addition and Multiplication

Read original: arXiv:2405.14642 - Published 5/24/2024 by Cosmin E. Oancea, Stephen M. Watt

📊

Overview

This paper explores using a high-level functional language for efficient GPU-based arithmetic on "midsize" integers, up to around a quarter million bits.
The goal is to understand if it's possible to support effective nested-parallel programs with a small, flexible codebase.
The researchers present GPU implementations for integer addition and multiplication that leverage temporal reuse from scratchpad memories.
They find that their solutions are simpler than related work, which often uses complex tiling strategies or computations in the domain of real numbers.
The researchers evaluate performance against the state-of-the-art CGBN library and find their CUDA prototype outperforms CGBN for integer sizes above 32K bits.

Plain English Explanation

In this paper, the researchers explore using a high-level programming language called a "functional language" to perform arithmetic operations on large integers, up to around a quarter million bits, on graphics processing units (GPUs). The main goal is to see if they can create efficient programs that can take advantage of the parallel processing power of GPUs, while keeping the underlying code simple and flexible.

The researchers focus on implementing two key operations: addition and multiplication of these large integers. They find ways to organize the work so that the GPU's high-speed temporary memory, called "scratchpad memory," is used effectively, which boosts performance. Compared to previous work, the researchers' solutions are much simpler, avoiding complex strategies that may be "overkill" for this problem.

The researchers test their GPU-based implementations against a state-of-the-art library called CGBN, which is optimized for large integer arithmetic. They find that their approach outperforms CGBN for integers larger than 32,000 bits, while providing comparable performance for smaller sizes. Interestingly, they also discover that a particular multiplication technique called FFT multiplication starts to outperform the more traditional multiplication approach as the integer sizes get larger, as long as the integers still fit in the GPU's memory.

Finally, the researchers examine the strengths and weaknesses of the high-level functional language they used, called Futhark, for efficiently supporting these types of computational tasks. They identify that a new compiler feature to better manage excess parallelism could significantly improve the performance.

Technical Explanation

The researchers present GPU-based implementations for addition and multiplication of "midsize" integers, which they define as up to around a quarter million bits. Their goal is to understand whether it is possible to support efficient nested-parallel programs using a small, flexible codebase.

For integer addition, the researchers recognize that this can be implemented efficiently on GPUs using a well-known technique called "scan," which allows the computation to be parallelized effectively. Previous work has shown the potential for scan-based addition on GPUs.

For integer multiplication, the researchers employ a simple work-partitioning strategy that provides good temporal locality, allowing the GPU's scratchpad memory to be used effectively. This is in contrast to related work that uses complex tiling strategies, which the researchers feel may be "too big a hammer for the job."

The researchers also explore FFT-based multiplication, where they efficiently map the computation to the domain of integral fields by finding "good" prime numbers that enable near-full utilization of machine words. This avoids the potential issues with working in the domain of real numbers, which could degrade the numerical precision.

The researchers evaluate the performance of their CUDA-based prototype implementations against the state-of-the-art CGBN library, developed by NvidiaLab. They find that their approach outperforms CGBN for integer sizes above 32K bits, while offering comparable performance for smaller sizes. Importantly, they also report that the FFT-based multiplication outperforms the classical multiplication approach on the larger integer sizes that still fit within a single CUDA block.

Critical Analysis

The researchers have presented a thoughtful and novel approach to achieving efficient GPU-based arithmetic on "midsize" integers using a high-level functional language. Their key contributions lie in the simplicity of their solutions, which avoid the complex tiling strategies employed in related work, as well as their use of the integral field domain for FFT-based multiplication.

One potential limitation of the research is the focus on integers that fit within a single CUDA block. While this is a reasonable constraint for "midsize" integers, it may not fully address the challenges of working with truly large integers that exceed the memory capacity of a single GPU. Exploring techniques for scaling the computations across multiple GPU blocks or even multiple GPUs could be an area for further research.

Additionally, the researchers acknowledge that the high-level functional language they used, Futhark, could benefit from a compiler pass to more efficiently manage excess parallelism. This suggests that the performance of their approach may be somewhat constrained by the capabilities of the language and compiler, rather than solely by the underlying algorithmic and architectural choices.

It would also be interesting to see how the researchers' techniques compare to other state-of-the-art libraries or approaches beyond just CGBN, as well as how the performance scales with the number of GPU cores or other hardware parameters.

Overall, the researchers have presented a valuable contribution to the field of GPU-accelerated arithmetic on large integers, and their work serves as a solid foundation for further exploration and optimization in this domain.

Conclusion

This paper explores the use of a high-level functional language for efficient GPU-based arithmetic on "midsize" integers, up to around a quarter million bits. The researchers present simple yet effective implementations for integer addition and multiplication that leverage the GPU's scratchpad memory to achieve good performance.

The key findings are that the researchers' CUDA-based prototype outperforms the state-of-the-art CGBN library for integer sizes above 32K bits, and that FFT-based multiplication can outperform the classical approach for the larger integer sizes that still fit within a single CUDA block.

While the focus on integers that fit within a single CUDA block is a reasonable constraint, exploring techniques for scaling the computations across multiple GPU blocks or even multiple GPUs could be an area for future research. Additionally, the researchers identify opportunities to improve the performance of their approach through compiler-level optimizations.

Overall, this work contributes valuable insights and techniques for leveraging the parallel processing power of GPUs for large-integer arithmetic, which has important applications in fields such as cryptography, scientific computing, and data analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

GPU Implementations for Midsize Integer Addition and Multiplication

Cosmin E. Oancea, Stephen M. Watt

This paper explores practical aspects of using a high-level functional language for GPU-based arithmetic on ``midsize'' integers. By this we mean integers of up to about a quarter million bits, which is sufficient for most practical purposes. The goal is to understand whether it is possible to support efficient nested-parallel programs with a small, flexible code base. We report on GPU implementations for addition and multiplication of integers that fit in one CUDA block, thus leveraging temporal reuse from scratchpad memories. Our key contribution resides in the simplicity of the proposed solutions: We recognize that addition is a straightforward application of scan, which is known to allow efficient GPU implementation. For quadratic multiplication we employ a simple work-partitioning strategy that offers good temporal locality. For FFT multiplication, we efficiently map the computation in the domain of integral fields by finding ``good'' primes that enable almost-full utilization of machine words. In comparison, related work uses complex tiling strategies -- which feel too big a hammer for the job -- or uses the computational domain of reals, which may degrade the magnitude of the base in which the computation is carried. We evaluate the performance in comparison to the state-of-the-art CGBN library, authored by NvidiaLab, and report that our CUDA prototype outperforms CGBN for integer sizes higher than 32K bits, while offering comparable performance for smaller sizes. Moreover, we are, to our knowledge, the first to report that FFT multiplication outperforms the classical one on the larger sizes that still fit in a CUDA block. Finally, we examine Futhark's strengths and weaknesses for efficiently supporting such computations and find out that a compiler pass aimed at efficient sequentialization of excess parallelism would significantly improve performance.

5/24/2024

🔗

Fast multiplication by two's complement addition of numbers represented as a set of polynomial radix 2 indexes, stored as an integer list for massively parallel computation

Mark Stocks

We demonstrate a multiplication method based on numbers represented as set of polynomial radix 2 indices stored as an integer list. The 'polynomial integer index multiplication' method is a set of algorithms implemented in python code. We demonstrate the method to be faster than both the Number Theoretic Transform (NTT) and Karatsuba for multiplication within a certain bit range. Also implemented in python code for comparison purposes with the polynomial radix 2 integer method. We demonstrate that it is possible to express any integer or real number as a list of integer indices, representing a finite series in base two. The finite series of integer index representation of a number can then be stored and distributed across multiple CPUs / GPUs. We show that operations of addition and multiplication can be applied as two's complement additions operating on the index integer representations and can be fully distributed across a given CPU / GPU architecture. We demonstrate fully distributed arithmetic operations such that the 'polynomial integer index multiplication' method overcomes the current limitation of parallel multiplication methods. Ie, the need to share common core memory and common disk for the calculation of results and intermediate results.

7/30/2024

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L. A. Torres, Carlos J. Barrios H, Yves Denneulin

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have lower performance but better precision. The study compares execution times and power consumption using PAPI and PERF and compares accuracy for different matrix sizes. Comparisons were made on architectures such as third and fourth-generation Intel CPUs and NVIDIA V100 and A100 GPUs. The MKL library showed the best performance with a slight loss of precision, while OpenMP and SYCL on the CPU implementation showed the best accuracy but a loss of performance. On the other hand, the results on GPU showed that cuBLAS with tensor cores had the best performance; however, it had a cost in accuracy. The cuBLAS library without these specialized cores shows minimal performance loss and much higher accuracy. The data obtained on different architectures showed that the CPU could achieve performance close to that obtained on the GPU with increased power consumption. These results are conditional on certain hardware specifications, such as the number of cores, clock frequency, processor generation for the CPU, and the speed and bandwidth of the PCI bus and device architecture (compute capability) for the GPU.

5/28/2024

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai

Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed Approximate Multiplication Unit (AMU). The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.

7/9/2024