WWW: What, When, Where to Compute-in-Memory

2312.15896

Published 6/21/2024 by Tanvi Sharma, Mustafa Ali, Indranil Chakraborty, Kaushik Roy

WWW: What, When, Where to Compute-in-Memory

Abstract

Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.

Create account to get full access

Overview

This paper explores the "what, when, and where" of compute-in-memory (CIM) technology, which aims to improve the efficiency of machine learning inference by performing computations directly within memory.
CIM has the potential to significantly reduce the energy and latency associated with data movement between memory and processors, which is a major bottleneck in conventional computer architectures.
The paper provides a comprehensive review of the state-of-the-art in CIM, including both SRAM-based and emerging memory technologies, and discusses the trade-offs and considerations for deploying CIM in different application domains.

Plain English Explanation

Computers today often struggle with the speed and energy efficiency of processing large amounts of data, especially for tasks like machine learning. This is because the data needs to be constantly moved back and forth between the processor and memory, which can be slow and energy-intensive.

Compute-in-memory (CIM) is a new approach that aims to address this by allowing computations to be performed directly within the memory itself, rather than having to shuttle the data back and forth. This can potentially make the process much faster and more efficient.

The paper examines the "what, when, and where" of CIM - in other words, what types of computations it can handle, when it might be most useful, and where it can best be deployed. It looks at the various CIM technologies that have been developed, both using standard SRAM memory and newer types of memory like analog or digital memory computing.

The paper discusses the trade-offs and considerations involved in using CIM, such as the potential improvements in speed and energy efficiency, but also any limitations or challenges that need to be addressed. It explores how CIM could be used in different application areas, like machine learning inference, and how it might be integrated into hybrid computing architectures.

Overall, the paper provides a comprehensive overview of the current state of CIM technology and its potential to transform the way we process data and run computationally intensive applications.

Technical Explanation

The paper begins by highlighting the growing importance of machine learning (ML) inference in a wide range of applications, and the challenges posed by the traditional von Neumann architecture in efficiently executing these workloads. The key bottleneck is the energy and latency associated with moving data between the processing units and memory, known as the "memory wall" problem.

To address this, the authors explore the concept of compute-in-memory (CIM), which aims to perform computations directly within the memory subsystem, thereby reducing the need for data movement. The paper provides a thorough review of the various CIM approaches, including SRAM-based in-memory computing as well as emerging memory technologies like resistive RAM (ReRAM) and phase-change memory (PCM).

The authors then delve into the "what, when, and where" of CIM, analyzing the types of computations that can be efficiently executed in-memory, the application domains that can benefit the most, and the architectural considerations for deploying CIM. A key focus is on the use of CIM for machine learning inference, where the authors discuss techniques like in-memory matrix-vector multiplications and the integration of CIM with traditional processors in hybrid architectures.

The paper also explores the trade-offs and challenges associated with CIM, such as the limited precision and programming complexity, and highlights areas for further research and development to overcome these limitations.

Critical Analysis

The paper provides a comprehensive and well-structured overview of the current state of compute-in-memory (CIM) technology, covering a wide range of approaches and highlighting the key considerations for its deployment.

One of the strengths of the paper is its balanced perspective, acknowledging both the potential benefits and the limitations of CIM. The authors do not present CIM as a panacea, but rather highlight the specific application domains and workloads where it can be most impactful, such as machine learning inference.

However, the paper could have delved deeper into the practical challenges and trade-offs involved in integrating CIM into real-world systems. For example, the authors could have discussed the challenges of managing heat dissipation, ensuring data integrity, and overcoming the programming complexity associated with CIM architectures.

Additionally, the paper could have explored the potential impact of CIM on the broader computer architecture landscape, and how it might influence the design of future processors, memory subsystems, and system-level optimizations.

Overall, the paper provides a solid foundation for understanding the current state of CIM and its potential applications, but more work is needed to address the practical implementation challenges and fully realize the benefits of this promising technology.

Conclusion

The paper presents a comprehensive overview of compute-in-memory (CIM) technology, exploring the "what, when, and where" of this emerging approach to improving the efficiency of machine learning inference and other data-intensive workloads.

By performing computations directly within the memory subsystem, CIM has the potential to significantly reduce the energy and latency associated with data movement, a major bottleneck in traditional computer architectures. The paper covers the various CIM technologies, including SRAM-based in-memory computing and emerging memory technologies like ReRAM and PCM, and discusses the trade-offs and considerations for deploying CIM in different application domains.

The authors' balanced perspective, acknowledging both the benefits and limitations of CIM, provides a realistic assessment of the current state of the technology and the work that still needs to be done to address practical implementation challenges. As the computing industry continues to grapple with the growing demands of data-intensive applications, CIM may emerge as a key part of the solution, and this paper serves as a valuable resource for understanding its potential and the path forward.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura

Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.

6/13/2024

cs.AR cs.LG

Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling

Jiacong Sun, Pouya Houshmand, Marian Verhelst

In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and benchmarking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-Memory Computing (AIMC) and Digital In-Memory Computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different workloads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.

5/27/2024

eess.SP cs.AR eess.IV

🌐

A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface

Guodong Yin, Mufeng Zhou, Yiming Chen, Wenjun Tang, Zekun Yang, Mingyen Lee, Xirui Du, Jinshan Yue, Jiaxin Liu, Huazhong Yang, Yongpan Liu, Xueqing Li

Performing data-intensive tasks in the von Neumann architecture is challenging to achieve both high performance and power efficiency due to the memory wall bottleneck. Computing-in-memory (CiM) is a promising mitigation approach by enabling parallel in-situ multiply-accumulate (MAC) operations within the memory with support from the peripheral interface and datapath. SRAM-based charge-domain CiM (CD-CiM) has shown its potential of enhanced power efficiency and computing accuracy. However, existing SRAM-based CD-CiM faces scaling challenges to meet the throughput requirement of high-performance multi-bit-quantization applications. This paper presents an SRAM-based high-throughput ReLU-optimized CD-CiM macro. It is capable of completing MAC and ReLU of two signed 8b vectors in one CiM cycle with only one A/D conversion. Along with non-linearity compensation for the analog computing and A/D conversion interfaces, this work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.

4/3/2024

cs.AR cs.LG

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators

Songyun Qu, Shixin Zhao, Bing Li, Yintao He, Xuyi Cai, Lei Zhang, Ying Wang

In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size, and crossbar number, it is necessary to develop compilation tools that are fully aware of the CIM architectural details and implementation diversity. However, due to the lack of architectural support in current popular open-source compiling stacks, existing CIM designs either manually deploy networks or build their own compilers, which is time-consuming and labor-intensive. Although some works expose the specific CIM device programming interfaces to compilers, they are often bound to a fixed CIM architecture, lacking the flexibility to support the CIM architectures with different computing granularity. On the other hand, existing compilation works usually consider the scheduling of limited operation types (such as crossbar-bound matrix-vector multiplication). Unlike conventional processors, CIM accelerators are featured by their diverse architecture, circuit, and device, which cannot be simply abstracted by a single level if we seek to fully explore the advantages brought by CIM. Therefore, we propose CIM-MLC, a universal multi-level compilation framework for general CIM architectures. We first establish a general hardware abstraction for CIM architectures and computing modes to represent various CIM accelerators. Based on the proposed abstraction, CIM-MLC can compile tasks onto a wide range of CIM accelerators having different devices, architectures, and programming interfaces. More importantly, compared with existing compilation work, CIM-MLC can explore the mapping and scheduling strategies across multiple architectural tiers, which form a tractable yet effective design space, to achieve better scheduling and instruction generation results.

5/9/2024

cs.AR cs.CL