Attention in SRAM on Tenstorrent Grayskull

Read original: arXiv:2407.13885 - Published 7/22/2024 by Moritz Thuning

🏷️

Overview

This paper examines the use of attention mechanisms in the SRAM memory of the Tenstorrent Grayskull e150 chip.
The authors explore the performance and energy efficiency of different attention architectures when implemented on the Tenstorrent Grayskull e150, a specialized chip for machine learning workloads.
The findings provide insights into the trade-offs and design considerations for deploying attention-based models on resource-constrained hardware.

Plain English Explanation

Attention is a powerful technique used in many state-of-the-art machine learning models. It allows these models to focus on the most relevant parts of their input, leading to improved performance. However, attention mechanisms can also be computationally intensive, which can be a challenge when deploying them on specialized hardware like the Tenstorrent Grayskull e150.

In this paper, the researchers investigate different ways of implementing attention in the SRAM (static random-access memory) of the Tenstorrent Grayskull e150. SRAM is a type of fast, on-chip memory that is often used to store intermediate results in machine learning computations. By optimizing the attention mechanisms to work well with the SRAM, the researchers aim to improve the overall performance and energy efficiency of attention-based models running on this specialized hardware.

The researchers explore several different attention architectures and measure their performance, energy usage, and other metrics when deployed on the Tenstorrent Grayskull e150. This allows them to identify the trade-offs between factors like speed, power consumption, and accuracy, and provide guidance on how to best design attention-based models for this type of hardware.

Technical Explanation

The paper begins by providing an overview of the Tenstorrent Grayskull e150, a specialized chip designed for machine learning workloads. The e150 features a unique architecture that includes on-chip SRAM, which can be used to store intermediate results and reduce the need for off-chip memory access.

The researchers then investigate several different attention mechanisms and how they can be implemented in the SRAM of the e150. They consider various attention architectures, including dot-product attention, scaled dot-product attention, and multi-head attention, and analyze their performance, energy efficiency, and other relevant metrics when deployed on the e150.

Through their experiments, the researchers identify key trade-offs and design considerations for attention-based models on the e150. For example, they find that simpler attention mechanisms can be more efficient in terms of energy usage, while more complex architectures may offer better performance but at the cost of increased power consumption.

The findings from this study provide valuable insights for researchers and engineers who are working on deploying attention-based models on resource-constrained hardware. By understanding the performance characteristics and design trade-offs of different attention mechanisms, they can make more informed decisions when designing and optimizing machine learning systems for specialized chips like the Tenstorrent Grayskull e150.

Critical Analysis

The paper provides a thorough and well-designed study of attention mechanisms on the Tenstorrent Grayskull e150 chip. The researchers have carefully considered various attention architectures and evaluated their performance, energy efficiency, and other relevant metrics, which is a valuable contribution to the field.

One potential limitation of the study is that it focuses solely on the e150 chip, and the findings may not be directly applicable to other hardware platforms or systems. It would be interesting to see if the researchers could extend their analysis to a broader range of hardware or explore the performance of attention mechanisms on different types of specialized chips.

Additionally, the paper does not delve deeply into the potential implications or real-world applications of their findings. While the technical details are well-covered, it would be beneficial to see a more in-depth discussion of how these insights could be leveraged by practitioners and researchers working on deploying attention-based models in resource-constrained environments.

Overall, this paper offers a valuable contribution to the ongoing research on attention mechanisms and their implementation on specialized hardware. The findings provide a solid foundation for further exploration and optimization of attention-based models in the context of machine learning on edge devices and other resource-constrained systems.

Conclusion

This paper presents a thorough investigation of attention mechanisms and their performance on the Tenstorrent Grayskull e150 chip, a specialized hardware platform for machine learning workloads. The researchers explore various attention architectures and analyze their trade-offs in terms of speed, energy efficiency, and other relevant metrics.

The study provides valuable insights for researchers and engineers working on deploying attention-based models on resource-constrained hardware. By understanding the performance characteristics and design considerations of different attention mechanisms, they can make more informed decisions when optimizing machine learning systems for specialized chips like the Tenstorrent Grayskull e150.

The findings from this paper contribute to the ongoing efforts to enhance the efficiency and deployment of attention-based models in a wide range of applications, from edge devices to high-performance computing systems. As the demand for powerful yet energy-efficient machine learning continues to grow, research like this will be crucial in enabling the next generation of intelligent, hardware-aware systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Attention in SRAM on Tenstorrent Grayskull

Moritz Thuning

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 times$, and the Softmax implementation inside the fused kernel is approximately $1.8 times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 times$ more SRAM.

7/22/2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$times$ lower numerical error than a baseline FP8 attention.

7/16/2024

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Ruhle, Saravan Rajmohan

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the stream-K style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.

5/20/2024

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm$^mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.

9/10/2024