PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation

Read original: arXiv:2408.16246 - Published 8/30/2024 by Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka

PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation

Overview

PACiM: A novel hybrid compute-in-memory architecture that leverages sparsity and probabilistic approximation for efficient deep learning inference
Combines analog and digital compute-in-memory to achieve high performance and energy efficiency
Exploits sparse activations and weights to reduce the number of computations and data transfers

Plain English Explanation

PACiM is a new type of computer chip design that aims to run deep learning models more efficiently. The key idea is to take advantage of the fact that many of the values (activations and weights) in deep neural networks are zero or close to zero.

Rather than performing computations on all of these near-zero values, PACiM uses a hybrid analog-digital approach to focus computation only on the important, non-zero values. The analog part of the chip performs approximate computations in memory, while the digital part handles the precise arithmetic.

This sparsity-centric design reduces the overall number of computations and data transfers required, leading to significant improvements in performance and energy efficiency compared to conventional digital processors. PACiM uses probabilistic approximation techniques to further optimize the computations, trading off some precision for even greater efficiency gains.

By optimizing the hardware for the unique characteristics of deep learning workloads, PACiM demonstrates the potential for specialized compute-in-memory architectures to advance the state-of-the-art in AI acceleration.

Technical Explanation

The key technical elements of PACiM include:

Hybrid Analog-Digital Design: PACiM combines analog compute-in-memory for approximate computations with a digital processor for precise arithmetic. This hybrid approach allows PACiM to efficiently leverage the strengths of both analog and digital computing.
Sparsity-Centric Architecture: PACiM is designed to take advantage of the sparse activations and weights common in deep neural networks. It avoids performing unnecessary computations on near-zero values, reducing the overall compute and data movement requirements.
Probabilistic Approximation: PACiM employs probabilistic approximation techniques to further optimize the analog computations. By trading off some precision, PACiM can achieve even greater performance and energy efficiency gains.
Memory-Centric Design: PACiM's architecture is centered around the memory subsystem, placing the analog compute units directly in the memory arrays. This minimizes data movement and enables efficient compute-in-memory operations.

The authors evaluate PACiM's performance and energy efficiency on several deep learning benchmarks, demonstrating significant improvements over conventional digital processors. The proposed architecture showcases the potential of specialized hardware designs to unlock new levels of AI acceleration.

Critical Analysis

The authors acknowledge several limitations and areas for future research:

Precision vs. Efficiency Tradeoff: The probabilistic approximation techniques used in PACiM may not be suitable for all types of deep learning models or applications that require high precision. Further research is needed to understand the impact of this tradeoff.
Hardware Complexity: The hybrid analog-digital design of PACiM introduces additional complexity in the chip fabrication and control logic. The feasibility and scalability of this approach require further investigation.
Robustness and Reliability: Analog computing can be more susceptible to variations in manufacturing and operating conditions. Ensuring the reliability and robustness of PACiM's analog components is an important consideration.
Generalization to Other Workloads: While PACiM is tailored for deep learning inference, its applicability to other types of compute-intensive workloads is not explored in the paper. Evaluating the architecture's versatility would be valuable.

Overall, PACiM presents a promising direction for advancing the state of AI acceleration through specialized hardware design. However, the practical implementation challenges and the broader applicability of the proposed approach warrant further research and development.

Conclusion

PACiM is a novel hybrid compute-in-memory architecture that leverages sparsity and probabilistic approximation to achieve high performance and energy efficiency for deep learning inference. By combining analog and digital computing and optimizing the hardware for the unique characteristics of deep neural networks, PACiM demonstrates the potential of specialized architectures to accelerate AI workloads.

While the paper highlights several limitations and areas for future work, the key insights and design principles of PACiM offer a compelling roadmap for the continued evolution of AI hardware. As the demand for efficient AI inference continues to grow, research like this can help drive the development of increasingly powerful and energy-efficient computing systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation

Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka

Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilistic approximate computation (PAC) method that leverages statistical techniques to approximate multiply-and-accumulation (MAC) operations, reducing approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in CiM systems by simplifying complex MAC vector computations into scalar calculations. Moreover, PAC enables sparsity encoding and eliminates the LSB activations transmission, significantly reducing data reads and writes. This sets PAC apart from traditional approximate computing techniques, minimizing not only computation power but also memory accesses by 50%, thereby boosting system-level efficiency. We developed PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS while maintaining high accuracy of 93.85/72.36/66.02% on CIFAR-10/CIFAR-100/ImageNet benchmarks using a ResNet-18 model, demonstrating the effectiveness of our PAC methodology.

8/30/2024

🌐

A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface

Guodong Yin, Mufeng Zhou, Yiming Chen, Wenjun Tang, Zekun Yang, Mingyen Lee, Xirui Du, Jinshan Yue, Jiaxin Liu, Huazhong Yang, Yongpan Liu, Xueqing Li

Performing data-intensive tasks in the von Neumann architecture is challenging to achieve both high performance and power efficiency due to the memory wall bottleneck. Computing-in-memory (CiM) is a promising mitigation approach by enabling parallel in-situ multiply-accumulate (MAC) operations within the memory with support from the peripheral interface and datapath. SRAM-based charge-domain CiM (CD-CiM) has shown its potential of enhanced power efficiency and computing accuracy. However, existing SRAM-based CD-CiM faces scaling challenges to meet the throughput requirement of high-performance multi-bit-quantization applications. This paper presents an SRAM-based high-throughput ReLU-optimized CD-CiM macro. It is capable of completing MAC and ReLU of two signed 8b vectors in one CiM cycle with only one A/D conversion. Along with non-linearity compensation for the analog computing and A/D conversion interfaces, this work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.

4/3/2024

PICO-RAM: A PVT-Insensitive Analog Compute-In-Memory SRAM Macro with In-Situ Multi-Bit Charge Computing and 6T Thin-Cell-Compatible Layout

Zhiyu Chen, Ziyuan Wen, Weier Wan, Akhil Reddy Pakala, Yiwei Zou, Wei-Chen Wei, Zengyi Li, Yubei Chen, Kaiyuan Yang

Analog compute-in-memory (CIM) in static random-access memory (SRAM) is promising for accelerating deep learning inference by circumventing the memory wall and exploiting ultra-efficient analog low-precision arithmetic. Latest analog CIM designs attempt bit-parallel schemes for multi-bit analog Matrix-Vector Multiplication (MVM), aiming at higher energy efficiency, throughput, and training simplicity and robustness over conventional bit-serial methods that digitally shift-and-add multiple partial analog computing results. However, bit-parallel operations require more complex analog computations and become more sensitive to well-known analog CIM challenges, including large cell areas, inefficient and inaccurate multi-bit analog operations, and vulnerability to PVT variations. This paper presents PICO-RAM, a PVT-insensitive and compact CIM SRAM macro with charge-domain bit-parallel computation. It adopts a multi-bit thin-cell Multiply-Accumulate (MAC) unit that shares the same transistor layout as the most compact 6T SRAM cell. All analog computing modules, including digital-to-analog converters (DACs), MAC units, analog shift-and-add, and analog-to-digital converters (ADCs) reuse one set of local capacitors inside the array, performing in-situ computation to save area and enhance accuracy. A compact 8.5-bit dual-threshold time-domain ADC power gates the main path most of the time, leading to a significant energy reduction. Our 65-nm prototype achieves the highest weight storage density of 559 Kb/mm${^2}$ and exceptional robustness to temperature and voltage variations (-40 to 105 $^{circ}$C and 0.65 to 1.2 V) among SRAM-based analog CIM designs.

7/19/2024

WWW: What, When, Where to Compute-in-Memory

Tanvi Sharma, Mustafa Ali, Indranil Chakraborty, Kaushik Roy

Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.

6/21/2024