PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

2404.08871

Published 4/16/2024 by Si Ung Noh, Junguk Hong, Chaemin Lim, Seongyeon Park, Jeehyun Kim, Hanjun Kim, Youngsok Kim, Jinho Lee

cs.DC cs.AR

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

Abstract

Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited by the huge overhead of inter-PE communication. This mainly comes from the slow CPU-mediated inter-PE communication methods which incurs significant performance overheads, making it difficult for PIM-enabled DIMMs to accelerate a wider range of applications. Prior studies have tried to alleviate the communication bottleneck, but they lack enough flexibility and performance to be used for a wide range of applications. In this paper, we present PID-Comm, a fast and flexible collective inter-PE communication framework for commodity PIM-enabled DIMMs. The key idea of PID-Comm is to abstract the PEs as a multi-dimensional hypercube and allow multiple instances of collective inter-PE communication between the PEs belonging to certain dimensions of the hypercube. Leveraging this abstraction, PID-Comm first defines eight collective inter-PE communication patterns that allow applications to easily express their complex communication patterns. Then, PID-Comm provides high-performance implementations of the collective inter-PE communication patterns optimized for the DIMMs. Our evaluation using 16 UPMEM DIMMs and representative parallel algorithms shows that PID-Comm greatly improves the performance by up to 4.20x compared to the existing inter-PE communication implementations. The implementation of PID-Comm is available at https://github.com/AIS-SNU/PID-Comm.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Presents a new framework called "\thiswork" for efficient collective communication on commodity processing-in-DIMM (Dual Inline Memory Module) devices
Aims to enable fast and flexible communication between processing units within DRAM (Dynamic Random Access Memory) modules
Leverages the unique characteristics of processing-in-DIMM architectures to optimize collective communication operations

Plain English Explanation

The paper introduces a new framework called "\thiswork" that is designed to enable efficient collective communication on processing-in-DIMM devices. These are specialized DRAM modules that have processing capabilities built into them, allowing some computations to be performed directly within the memory itself.

The key idea behind \thiswork is to take advantage of the unique features of processing-in-DIMM architectures to optimize collective communication operations. Collective communication refers to the exchange of information between multiple processing units, such as performing a sum or finding the maximum value across a group of nodes. By designing a communication framework tailored to the processing-in-DIMM setting, the researchers aim to enable faster and more flexible collective communication compared to traditional approaches.

The paper explores how the \thiswork framework can be used to accelerate a variety of collective communication primitives, such as reduce, broadcast, and all-to-all operations. By leveraging the processing capabilities within the DRAM modules, the \thiswork framework aims to achieve improved performance and flexibility compared to traditional approaches that rely on the CPU (Central Processing Unit) or external accelerators for these communication tasks.

Technical Explanation

The paper first provides background on processing-in-DIMM architectures, which integrate processing units directly into DRAM modules. This allows certain computations to be performed within the memory itself, reducing the need to transfer data to and from the CPU. The authors then present the \thiswork framework, which is designed to enable efficient collective communication on these processing-in-DIMM devices.

The \thiswork framework includes several key components:

Communication Primitives: \thiswork supports a range of collective communication primitives, such as reduce, broadcast, and all-to-all operations. These are implemented using specialized hardware and software within the processing-in-DIMM devices.
Flexible Mapping: \thiswork provides a flexible mapping mechanism that allows the collective communication primitives to be efficiently mapped to the underlying processing-in-DIMM hardware. This includes support for different memory channel configurations and the ability to leverage the parallel processing capabilities of the DRAM modules.
Optimization Techniques: The paper explores various optimization techniques to improve the performance of \thiswork, such as more scalable sparse dynamic data exchange and priority-driven accelerator management.

The authors evaluate the \thiswork framework through extensive experiments, comparing its performance to traditional approaches for collective communication. The results demonstrate the benefits of \thiswork in terms of latency, throughput, and flexibility when working with processing-in-DIMM devices.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated framework for collective communication on processing-in-DIMM devices. The authors have clearly identified the unique challenges and opportunities presented by these emerging architectures and have developed a solution that effectively leverages the processing capabilities within the DRAM modules.

One potential limitation of the \thiswork framework is that it may be specific to the particular processing-in-DIMM hardware used in the experiments. The authors mention that \thiswork is designed to be flexible and adaptable, but further research may be needed to understand how well it would perform on a wider range of processing-in-DIMM architectures.

Additionally, the paper does not explore the energy efficiency or power consumption of the \thiswork framework, which could be an important consideration for real-world deployments. Further analysis of the energy and power characteristics of \thiswork would be valuable.

Overall, the \thiswork framework represents a significant contribution to the field of collective communication in the context of processing-in-DIMM devices. The paper's thorough evaluation and clear explanations make it a valuable resource for researchers and developers working in this area.

Conclusion

The \thiswork framework presented in this paper offers a novel approach to enabling efficient collective communication on commodity processing-in-DIMM devices. By leveraging the unique capabilities of these emerging memory architectures, \thiswork aims to achieve faster and more flexible communication between processing units within DRAM modules.

The paper's technical explanations and experimental results demonstrate the potential of \thiswork to accelerate a variety of collective communication primitives, such as reduce, broadcast, and all-to-all operations. The framework's flexibility and optimization techniques make it a promising solution for researchers and developers working to harness the power of processing-in-DIMM architectures.

As processing-in-DIMM technologies continue to evolve, the insights and approaches presented in this paper will likely become increasingly valuable for advancing the state of the art in memory-centric computing and enabling more efficient data-intensive applications.

Related Papers

Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

Steve Rhyner, Haocong Luo, Juan G'omez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumption. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities and characteristics of popular distributed optimization algorithms on real-world PIM architectures to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized distributed optimization algorithms on UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and the need to shift to an algorithm-hardware codesign perspective to accommodate decentralized distributed optimization algorithms. Our results demonstrate three major findings: 1) Modern general-purpose PIM architectures can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, when operations and datatypes are natively supported by PIM hardware, 2) the importance of carefully choosing the optimization algorithm that best fit PIM, and 3) contrary to popular belief, contemporary PIM architectures do not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. To facilitate future research, we aim to open-source our complete codebase.

4/11/2024

cs.AR cs.AI cs.DC cs.LG

✅

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, Jongse Park

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively. Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a blocked mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naive NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3$times$, 2.4$times$ and 1.6$times$ throughput improvement, respectively.

4/1/2024

cs.AR

🔄

On Error Correction for Nonvolatile Processing-In-Memory

Husrev C{i}lasun, Salonik Resch, Zamshed I. Chowdhury, Masoud Zabihi, Yang Lv, Brandon Zink, Jian-Ping Wang, Sachin S. Sapatnekar, Ulya R. Karpuzcu

Processing in memory (PiM) represents a promising computing paradigm to enhance performance of numerous data-intensive applications. Variants performing computing directly in emerging nonvolatile memories can deliver very high energy efficiency. PiM architectures directly inherit the vulnerabilities of the underlying memory substrates, but they also are subject to errors due to the computation in place. Numerous well-established error correcting codes (ECC) for memory exist, and are also considered in the PiM context, however, they typically ignore errors that occur throughout computation. In this paper we revisit the error correction design space for nonvolatile PiM, considering both storage/memory and computation-induced errors, surveying several self-checking and homomorphic approaches. We propose several solutions and analyze their complex performance-area-coverage trade-off, using three representative nonvolatile PiM technologies. All of these solutions guarantee single error correction for both, bulk bitwise computations and ordinary memory/storage errors.

4/30/2024

cs.ET

📊

SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels

Nabil Abubaker, Torsten Hoefler

Existing 3D algorithms for distributed-memory sparse kernels suffer from limited scalability due to reliance on bulk sparsity-agnostic communication. While easier to use, sparsity-agnostic communication leads to unnecessary bandwidth and memory consumption. We present SpComm3D, a framework for enabling sparsity-aware communication and minimal memory footprint such that no unnecessary data is communicated or stored in memory. SpComm3D performs sparse communication efficiently with minimal or no communication buffers to further reduce memory consumption. SpComm3D detaches the local computation at each processor from the communication, allowing flexibility in choosing the best accelerated version for computation. We build 3D algorithms with SpComm3D for the two important sparse ML kernels: Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse matrix-matrix multiplication (SpMM). Experimental evaluations on up to 1800 processors demonstrate that SpComm3D has superior scalability and outperforms state-of-the-art sparsity-agnostic methods with up to 20x improvement in terms of communication, memory, and runtime of SDDMM and SpMM. The code is available at: https://github.com/nfabubaker/SpComm3D

5/1/2024

cs.DC