NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

2403.00579

Published 4/1/2024 by Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, Jongse Park

cs.AR

✅

Abstract

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively. Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a blocked mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naive NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3$times$, 2.4$times$ and 1.6$times$ throughput improvement, respectively.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Modern language models are built using transformer neural networks, which consist of multiple decoder blocks.
Each decoder block has three key components: QKV generation, multi-head attention, and feed-forward networks.
The QKV generation and feed-forward networks involve computationally-intensive matrix-matrix multiplications (GEMM), while the multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV).
Existing machine learning accelerators like TPUs and NPUs are efficient at GEMM but less so for GEMV computations.
Conversely, Processing-in-Memory (PIM) technology is well-suited for GEMV but lacks the power for GEMM.

Plain English Explanation

Language models are a type of artificial intelligence that can understand and generate human-like text. They are built using a neural network architecture called transformers, which consists of multiple processing blocks. Each of these blocks has three main components: QKV generation, multi-head attention, and feed-forward networks.

The QKV generation and feed-forward networks require a lot of computational power to perform matrix-matrix multiplications, a type of mathematical operation. Meanwhile, the multi-head attention part needs to move a lot of data around to do matrix-vector multiplications.

Existing specialized hardware like TPUs and NPUs are great at the matrix-matrix multiplications, but struggle with the matrix-vector multiplications. On the other hand, a newer technology called Processing-in-Memory (PIM) is well-suited for the matrix-vector multiplications, but not as powerful for the matrix-matrix ones.

Technical Explanation

The proposed NeuPIMs system aims to address this mismatch by combining a conventional GEMM-focused NPU with GEMV-optimized PIM devices. The key challenges in efficiently integrating these two components are:

Existing PIMs typically operate in a blocked mode, allowing only either the NPU or PIM to be active at a given time.
The dependencies between GEMM and GEMV computations in language models restrict their parallel processing.

To tackle these challenges, NeuPIMs is designed with dual row buffers in each memory bank, enabling simultaneous management of memory operations and PIM commands. Additionally, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to pipeline two independent sub-batches within a single NeuPIMs device.

Critical Analysis

The paper provides a well-designed solution to the problem of efficiently accelerating language models by combining the strengths of NPUs and PIM devices. However, some potential limitations and areas for further research include:

The evaluation is primarily focused on throughput improvement, and other metrics like power efficiency and latency may also be important considerations for real-world deployment.
The proposed techniques may introduce additional complexity in the system design and memory management, which could impact the overall cost and feasibility.
The generalizability of the NeuPIMs approach to other types of neural networks or applications beyond language models is not explicitly addressed in the paper.

Further research could explore these aspects and investigate the broader applicability of the heterogeneous acceleration principles developed in this work.

Conclusion

The NeuPIMs system presents a novel approach to accelerating transformer-based language models by leveraging the complementary strengths of NPUs and PIM devices. By addressing the challenges of concurrent execution and leveraging batch parallelism, NeuPIMs achieves significant throughput improvements over existing GPU-only, NPU-only, and naive NPU+PIM approaches. This work highlights the potential of heterogeneous acceleration systems to address the complex computational requirements of modern large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

Steve Rhyner, Haocong Luo, Juan G'omez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumption. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities and characteristics of popular distributed optimization algorithms on real-world PIM architectures to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized distributed optimization algorithms on UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and the need to shift to an algorithm-hardware codesign perspective to accommodate decentralized distributed optimization algorithms. Our results demonstrate three major findings: 1) Modern general-purpose PIM architectures can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, when operations and datatypes are natively supported by PIM hardware, 2) the importance of carefully choosing the optimization algorithm that best fit PIM, and 3) contrary to popular belief, contemporary PIM architectures do not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. To facilitate future research, we aim to open-source our complete codebase.

4/11/2024

cs.AR cs.AI cs.DC cs.LG

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86$times$ speedup for GEMVs (of the available 7$times$ roofline speedup) leading to up to 5$times$ speedup for per-token latencies.

4/1/2024

cs.AR cs.DC

Insight Gained from Migrating a Machine Learning Model to Intelligence Processing Units

Hieu Le, Zhenhua He, Mai Le, Dhruva K. Chakravorty, Lisa M. Perez, Akhil Chilumuru, Yan Yao, Jiefu Chen

The discoveries in this paper show that Intelligence Processing Units (IPUs) offer a viable accelerator alternative to GPUs for machine learning (ML) applications within the fields of materials science and battery research. We investigate the process of migrating a model from GPU to IPU and explore several optimization techniques, including pipelining and gradient accumulation, aimed at enhancing the performance of IPU-based models. Furthermore, we have effectively migrated a specialized model to the IPU platform. This model is employed for predicting effective conductivity, a parameter crucial in ion transport processes, which govern the performance of multiple charge and discharge cycles of batteries. The model utilizes a Convolutional Neural Network (CNN) architecture to perform prediction tasks for effective conductivity. The performance of this model on the IPU is found to be comparable to its execution on GPUs. We also analyze the utilization and performance of Graphcore's Bow IPU. Through benchmark tests, we observe significantly improved performance with the Bow IPU when compared to its predecessor, the Colossus IPU.

4/17/2024

cs.LG cs.AI

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan

Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

4/30/2024

cs.CL cs.DC