An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Read original: arXiv:2409.04940 - Published 9/10/2024 by Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Overview

Presents an analog and digital hybrid attention accelerator for transformers that uses charge-based in-memory computing
Focuses on efficient attention computation and token pruning to reduce compute and memory requirements
Leverages both analog and digital components to achieve high performance and energy efficiency

Plain English Explanation

The research paper describes a new hardware design for accelerating the attention mechanism in transformer models, which is a key component of many modern AI systems. The proposed approach combines analog and digital processing to achieve high performance and energy efficiency.

The core idea is to use a

charge-based in-memory computing

technique to perform the attention computations in an analog fashion, which can be much more efficient than traditional digital approaches. This analog module is combined with a digital component that handles other parts of the transformer model, as well as an efficient

token pruning

mechanism to reduce the overall computational workload.

The researchers demonstrate that this hybrid design can achieve significant performance and energy improvements compared to prior work, making it a promising approach for deploying large transformer models on resource-constrained edge devices.

Technical Explanation

The paper presents an

analog and digital hybrid attention accelerator

for transformers that leverages

charge-based in-memory computing

to efficiently compute the attention mechanism. The key elements of the proposed design include:

Analog Attention Module: This module performs the attention computations (dot product and softmax) in the analog domain using charge-based in-memory computing, which can be much more energy-efficient than digital approaches.
Digital Control Module: This component handles the other transformer operations (e.g., feed-forward, layer normalization) in the digital domain, as well as coordinating the overall computation flow.
Token Pruning: To further reduce the computational workload, the design incorporates a
token pruning
mechanism that selectively skips the attention computation for tokens that are deemed less important.

The researchers evaluate their design using various transformer-based models and benchmark tasks, demonstrating significant improvements in terms of both performance and energy efficiency compared to prior

attention acceleration

and

hardware-aware attention

approaches.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated hybrid attention accelerator that leverages both analog and digital processing to achieve high performance and energy efficiency. The use of charge-based in-memory computing for the attention computation is a promising approach that aligns with the growing interest in analog and mixed-signal hardware for AI acceleration.

One potential limitation of the proposed design is its reliance on the token pruning mechanism, which may not be applicable to all transformer-based models or tasks. The effectiveness of the pruning approach could be sensitive to the specific model architecture and input data, and further research may be needed to understand its broader applicability.

Additionally, the paper does not provide much detail on the implementation complexity and area/power tradeoffs of the hybrid design, which would be important considerations for real-world deployment. Further analysis of the hardware overhead and potential scalability challenges would be valuable.

Overall, the research presents a compelling approach that combines analog and digital processing to achieve efficient attention acceleration, and the results suggest that this hybrid design is a promising direction for future work in transformer hardware acceleration.

Conclusion

The paper introduces an analog and digital hybrid attention accelerator for transformers that leverages charge-based in-memory computing to efficiently perform the attention mechanism. By combining analog and digital components, the design achieves significant performance and energy improvements compared to prior attention acceleration techniques.

The key strengths of the proposed approach are its ability to leverage the efficiency of analog processing for the attention computations, while still maintaining the flexibility and programmability of digital control logic. The incorporation of a token pruning mechanism further enhances the overall efficiency of the system.

The research represents an important step forward in developing hardware-accelerated solutions for deploying large transformer models on resource-constrained edge devices. The hybrid design showcases the potential of combining analog and digital processing to address the growing computational demands of modern AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm$^mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.

9/10/2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$times$ lower numerical error than a baseline FP8 attention.

7/16/2024

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Ruhle, Saravan Rajmohan

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the stream-K style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.

5/20/2024

🧠

ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks

Salma Afifi, Ishan Thakkar, Sudeep Pasricha

Transformers have emerged as a powerful tool for natural language processing (NLP) and computer vision. Through the attention mechanism, these models have exhibited remarkable performance gains when compared to conventional approaches like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nevertheless, transformers typically demand substantial execution time due to their extensive computations and large memory footprint. Processing in-memory (PIM) and near-memory computing (NMC) are promising solutions to accelerating transformers as they offer high compute parallelism and memory bandwidth. However, designing PIM/NMC architectures to support the complex operations and massive amounts of data that need to be moved between layers in transformer neural networks remains a challenge. We propose ARTEMIS, a mixed analog-stochastic in-DRAM accelerator for transformer models. Through employing minimal changes to the conventional DRAM arrays, ARTEMIS efficiently alleviates the costs associated with transformer model execution by supporting stochastic computing for multiplications and temporal analog accumulations using a novel in-DRAM metal-on-metal capacitor. Our analysis indicates that ARTEMIS exhibits at least 3.0x speedup, 1.8x lower energy, and 1.9x better energy efficiency compared to GPU, TPU, CPU, and state-of-the-art PIM transformer hardware accelerators.

7/18/2024