FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

Read original: arXiv:2404.16317 - Published 4/26/2024 by Gabriel Kulp, Andrew Ensinger, Lizhong Chen

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

Overview

This paper presents FLAASH, a flexible accelerator architecture for efficient computation of sparse high-order tensor contractions.
Tensor contractions are a fundamental operation in machine learning and scientific computing, but can be computationally expensive, especially for high-order tensors with sparsity.
FLAASH aims to accelerate sparse high-order tensor contractions by exploiting flexibility in the hardware architecture and runtime optimizations.

Plain English Explanation

FLAASH is a new type of computer chip or "accelerator" designed to speed up a key mathematical operation called tensor contraction. Tensor contraction is an important step in many machine learning and scientific computing algorithms, but it can be very computationally intensive, especially when the tensors (multi-dimensional arrays of data) are large and sparse (containing many zero values).

The researchers behind FLAASH recognized this challenge and developed a flexible hardware architecture that can adapt to different tensor shapes and sparsity patterns. By being more flexible than traditional accelerators, FLAASH is able to take advantage of the sparsity in the data to perform tensor contractions much faster. This could lead to significant speedups for a wide range of applications, from training large deep learning models to accelerating scientific simulations.

Technical Explanation

The key innovations in FLAASH are its flexible datapath and runtime optimizations to exploit sparse tensor structure. The architecture includes a spatially-partitioned datapath that can dynamically adjust the dataflow to match the tensor sparsity pattern. It also employs techniques like hash-based tensor decomposition and dynamic scheduling to further optimize the computation.

Through extensive evaluations, the authors demonstrate that FLAASH can achieve significant speedups over state-of-the-art tensor contraction accelerators, especially for high-order sparse tensors. The architecture's flexibility allows it to adapt to a wide range of tensor shapes and sparsity patterns, leading to better overall performance.

Critical Analysis

The paper provides a compelling technical solution to the challenge of accelerating sparse high-order tensor contractions. The flexible datapath and runtime optimizations appear well-designed and the experimental results are convincing.

However, the authors do not deeply discuss potential limitations or caveats of their approach. For example, it's unclear how the performance and energy efficiency of FLAASH would scale for truly massive tensor sizes that may exceed the on-chip memory capacity. Additionally, the impact of the additional hardware complexity on chip area, power, and cost is not thoroughly analyzed.

Further research could explore these practical deployment considerations, as well as investigating how FLAASH's techniques could be extended to other types of sparse tensor operations beyond just contraction.

Conclusion

Overall, the FLAASH architecture presents a promising approach to accelerating a critical computation for machine learning and scientific computing. By introducing flexibility into the hardware design, the authors have demonstrated significant performance gains for sparse high-order tensor contractions. This work could pave the way for more efficient large-scale tensor-based applications across a variety of domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

Gabriel Kulp, Andrew Ensinger, Lizhong Chen

Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.

4/26/2024

HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

Zhewen Yu, Sudarshan Sreeram, Krish Agrawal, Junyi Wu, Alexander Montgomerie-Corcoran, Cheng Zhang, Jianyi Cheng, Christos-Savvas Bouganis, Yiren Zhao

Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3$times$ to 4.2$times$ compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: url{https://github.com/Yu-Zhewen/HASS}

6/6/2024

Compressing Structured Tensor Algebra

Mahdi Ghorbani, Emilien Bauer, Tobias Grosser, Amir Shaikhha

Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter a dilemma between the highly specialized dense tensor algebra and efficient structure-aware algorithms provided by sparse tensor algebra. In this paper, we introduce DASTAC, a framework to propagate the tensors's captured high-level structure down to low-level code generation by incorporating techniques such as automatic data layout compression, polyhedral analysis, and affine code generation. Our methodology reduces memory footprint by automatically detecting the best data layout, heavily benefits from polyhedral optimizations, leverages further optimizations, and enables parallelization through MLIR. Through extensive experimentation, we show that DASTAC achieves 1 to 2 orders of magnitude speedup over TACO, a state-of-the-art sparse tensor compiler, and StructTensor, a state-of-the-art structured tensor algebra compiler, with a significantly lower memory footprint.

7/19/2024

LoAS: Fully Temporal-Parallel Datatflow for Dual-Sparse Spiking Neural Networks

Ruokai Yin, Youngeun Kim, Di Wu, Priyadarshini Panda

Spiking Neural Networks (SNNs) have gained significant research attention in the last decade due to their potential to drive resource-constrained edge devices. Though existing SNN accelerators offer high efficiency in processing sparse spikes with dense weights, opportunities are less explored in SNNs with sparse weights, i.e., dual-sparsity. In this work, we study the acceleration of dual-sparse SNNs, focusing on their core operation, sparse-matrix-sparse-matrix multiplication (spMspM). We observe that naively running a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) exhibits sub-optimal efficiency. The main challenge is that processing timesteps, a natural property of SNNs, introduces an extra loop to ANN spMspM, leading to longer latency and more memory traffic. To address the problem, we propose a fully temporal-parallel (FTP) dataflow, which minimizes both data movement across timesteps and the end-to-end latency of dual-sparse SNNs. To maximize the efficiency of FTP dataflow, we propose an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. We further propose an FTP-friendly inner-join circuit that can lower the cost of the expensive prefix-sum circuits with almost no throughput penalty. All the above techniques for FTP dataflow are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. With FTP dataflow, compression, and inner-join, running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to $8.51times$) and energy reduction (up to $3.68times$) compared to running it on prior dual-sparse accelerators.

9/4/2024