A Transverse-Read-assisted Valid-Bit Collection to Accelerate Stochastic Conmputing MAC for Energy-Efficient in-RTM DNNs

Read original: arXiv:2407.07476 - Published 7/23/2024 by Jihe Wang, Zhiying Zhang, Xingwu Dong, Danghui Wang

A Transverse-Read-assisted Valid-Bit Collection to Accelerate Stochastic Conmputing MAC for Energy-Efficient in-RTM DNNs

Overview

Racetrack Memory (RTM) technology
Stochastic computing for energy-efficient Deep Neural Networks (DNNs)
Transverse read to accelerate Multiply-Accumulate (MAC) operations in stochastic computing

Plain English Explanation

This research explores an approach to improving the energy efficiency of deep neural networks by combining Racetrack Memory (RTM) technology with stochastic computing techniques.

The key idea is to use a "transverse read" mechanism to quickly determine the validity of bits during the stochastic computing process. This helps to accelerate the Multiply-Accumulate (MAC) operations that are critical for neural network inference, while also reducing the overall energy consumption.

The authors demonstrate that this transverse-read-assisted approach can provide significant performance and energy benefits compared to traditional approaches, making it a promising solution for deploying energy-efficient deep learning models, particularly in RTM-based systems.

Technical Explanation

The paper describes a novel architecture that combines Racetrack Memory (RTM) technology with a transverse-read-assisted stochastic computing technique to accelerate the Multiply-Accumulate (MAC) operations in deep neural networks.

In this approach, the authors leverage the high parallelism and density of RTM to store the weights and activations of the neural network. They then use a "transverse read" mechanism to quickly determine the validity of the stochastic bits during the MAC computations, which helps to reduce the overall latency and energy consumption.

The proposed architecture includes several key components:

An RTM-based memory system to store the network weights and activations
A stochastic computing unit that performs the MAC operations using the stored data
A transverse read circuit that quickly identifies the valid bits in the stochastic data stream, accelerating the MAC computations
A control logic unit that orchestrates the data movement and computation within the system

The authors evaluate the performance and energy efficiency of their approach through simulations and compare it to alternative SRAM-based and RRAM-based DNN accelerators. The results demonstrate significant improvements in terms of energy consumption and inference latency, making this a promising solution for deploying energy-efficient deep learning models, particularly in RTM-based systems.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated solution for improving the energy efficiency of deep neural networks using RTM technology and stochastic computing. The authors have clearly identified the key challenges and have proposed a novel architecture to address them.

One potential limitation of the approach is the reliance on the transverse read mechanism, which may introduce additional hardware complexity and cost. Additionally, the performance and energy benefits of the proposed solution may be dependent on the specific neural network architecture and the workload characteristics.

Further research could explore the applicability of this approach to a wider range of neural network models and applications, as well as investigate the tradeoffs between the hardware complexity, performance, and energy efficiency in more depth. It would also be valuable to explore the integration of this approach with other emerging memory technologies to further improve the overall system efficiency.

Conclusion

This research presents a promising solution for improving the energy efficiency of deep neural networks by combining Racetrack Memory (RTM) technology with a transverse-read-assisted stochastic computing technique. The authors demonstrate significant performance and energy benefits compared to traditional approaches, making this a valuable contribution to the field of energy-efficient deep learning accelerators.

The proposed architecture leverages the high parallelism and density of RTM to store the weights and activations of the neural network, while the transverse read mechanism accelerates the critical Multiply-Accumulate (MAC) operations. This approach has the potential to enable the deployment of energy-efficient deep learning models in a wide range of applications, from edge devices to large-scale cloud computing infrastructures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Transverse-Read-assisted Valid-Bit Collection to Accelerate Stochastic Conmputing MAC for Energy-Efficient in-RTM DNNs

Jihe Wang, Zhiying Zhang, Xingwu Dong, Danghui Wang

It looks attractive to coordinate racetrack-memory(RM) and stochastic-computing (SC) jointly to build an ultra-low power neuron-architecture. However, the above combination has always been questioned in a fatal weakness that the narrow bit-view of the RM-MTJ structure, a.k.a. shift-and-access pattern, cannot physically match the great throughput of direct-stored stochastic sequences. Fortunately, a recently developed Transverse-Read(TR) provides a wider segment-view to RM via detecting the resistance of domain-walls between a couple of MTJs on single nanowire, therefore RM can be enhanced with a faster access to the sequences without any substantial domain-shift. To utilize TR for a power-efficient SC-DNNs, we propose a segment-based compression to leverage one-cycle TR to only read those kernel segments of stochastic sequences, meanwhile, remove a large number of redundant segments for ultra-high storage density. In decompression stage, low-discrepancy stochastic sequences can be quickly reassembled by a select-and-output loop using kernel segments rather than slowly regenerated by costly SNGs. Since TR can provide an ideal in-memory acceleration in one-counting, counter-free SC-MACs are designed and deployed near RMs to form a power-efficient neuron-architecture, in which, the binary results of TR are activated straightforward without sluggish APCs. The results show that under the TR aided RM model, the power efficiency, speed, and stochastic accuracy of Seed-based Fast Stochastic Computing significantly enhance the performance of DNNs. The speed of computation is 2.88x faster in Lenet-5 and 4.40x faster in VGG-19 compared to the CORUSCANT. The integration of TR with RTM is deployed near the memory to create a power-efficient neuron architecture, eliminating the need for slow Accumulative Parallel Counters (APCs) and improving access speed to stochastic sequences.

7/23/2024

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

Ethan G Rogers, Sohan Salahuddin Mugdho, Kshemal Kshemendra Gupte, Cheng Wang

Crossbar-based in-memory computing (IMC) has emerged as a promising platform for hardware acceleration of deep neural networks (DNNs). However, the energy and latency of IMC systems are dominated by the large overhead of the peripheral analog-to-digital converters (ADCs). To address such ADC bottleneck, here we propose to implement stochastic processing of array-level partial sums (PS) for efficient IMC. Leveraging the probabilistic switching of spin-orbit torque magnetic tunnel junctions, the proposed PS processing eliminates the costly ADC, achieving significant improvement in energy and area efficiency. To mitigate accuracy loss, we develop PS-quantization-aware training that enables backward propagation across stochastic PS. Furthermore, a novel scheme with an inhomogeneous sampling length of the stochastic conversion is proposed. When running ResNet20 on the CIFAR-10 dataset, our architecture-to-algorithm co-design demonstrates up to 22x, 30x, and 142x improvement in energy, latency, and area, respectively, compared to IMC with standard ADC. Our optimized design configuration using stochastic PS achieved 666x (111x) improvement in Energy-Delay-Product compared to IMC with full precision ADC (sparse low-bit ADC), while maintaining near-software accuracy at various benchmark classification tasks.

7/18/2024

🌐

A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface

Guodong Yin, Mufeng Zhou, Yiming Chen, Wenjun Tang, Zekun Yang, Mingyen Lee, Xirui Du, Jinshan Yue, Jiaxin Liu, Huazhong Yang, Yongpan Liu, Xueqing Li

Performing data-intensive tasks in the von Neumann architecture is challenging to achieve both high performance and power efficiency due to the memory wall bottleneck. Computing-in-memory (CiM) is a promising mitigation approach by enabling parallel in-situ multiply-accumulate (MAC) operations within the memory with support from the peripheral interface and datapath. SRAM-based charge-domain CiM (CD-CiM) has shown its potential of enhanced power efficiency and computing accuracy. However, existing SRAM-based CD-CiM faces scaling challenges to meet the throughput requirement of high-performance multi-bit-quantization applications. This paper presents an SRAM-based high-throughput ReLU-optimized CD-CiM macro. It is capable of completing MAC and ReLU of two signed 8b vectors in one CiM cycle with only one A/D conversion. Along with non-linearity compensation for the analog computing and A/D conversion interfaces, this work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.

4/3/2024

🏷️

Experimental demonstration of magnetic tunnel junction-based computational random-access memory

Yang Lv, Brandon R. Zink, Robert P. Bloom, Husrev C{i}lasun, Pravin Khanal, Salonik Resch, Zamshed Chowdhury, Ali Habiboglu, Weigang Wang, Sachin S. Sapatnekar, Ulya Karpuzcu, Jian-Ping Wang

Conventional computing paradigm struggles to fulfill the rapidly growing demands from emerging applications, especially those for machine intelligence, because much of the power and energy is consumed by constant data transfers between logic and memory modules. A new paradigm, called computational random-access memory (CRAM) has emerged to address this fundamental limitation. CRAM performs logic operations directly using the memory cells themselves, without having the data ever leave the memory. The energy and performance benefits of CRAM for both conventional and emerging applications have been well established by prior numerical studies. However, there lacks an experimental demonstration and study of CRAM to evaluate its computation accuracy, which is a realistic and application-critical metrics for its technological feasibility and competitiveness. In this work, a CRAM array based on magnetic tunnel junctions (MTJs) is experimentally demonstrated. First, basic memory operations as well as 2-, 3-, and 5-input logic operations are studied. Then, a 1-bit full adder with two different designs is demonstrated. Based on the experimental results, a suite of modeling has been developed to characterize the accuracy of CRAM computation. Scalar addition, multiplication, and matrix multiplication, which are essential building blocks for many conventional and machine intelligence applications, are evaluated and show promising accuracy performance. With the confirmation of MTJ-based CRAM's accuracy, there is a strong case that this technology will have a significant impact on power- and energy-demanding applications of machine intelligence.

5/31/2024