Allo: A Programming Model for Composable Accelerator Design

2404.04815

Published 4/9/2024 by Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang

📈

Abstract

Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.

Create account to get full access

Overview

Special-purpose hardware accelerators are becoming increasingly important for improving performance in emerging applications
However, designing complex, high-performance accelerator architectures is challenging with existing tools and methodologies
Allo is a new programming model that aims to simplify the design of efficient spatial accelerators

Plain English Explanation

Computers today rely heavily on specialized hardware called "accelerators" to handle complex tasks like artificial intelligence and scientific simulations. These accelerators are designed to be much faster and more efficient than general-purpose processors for specific applications. However, building these accelerators is very difficult using the tools currently available to engineers.

Allo is a new system that aims to make it easier to design powerful accelerators. It allows engineers to separate the high-level algorithm they want to run from the low-level details of how the hardware should be configured. This separation makes it possible to easily experiment with different hardware optimizations without having to completely rewrite the underlying software.

The key innovation in Allo is its "composable" design, which means the different hardware components can be easily combined and recombined in different ways. This allows for more holistic optimizations that span multiple parts of the accelerator design, leading to better performance than what's possible with existing tools.

Technical Explanation

The Allo programming model decouples the specification of the algorithm from the hardware customizations required for efficient implementation. It encapsulates hardware customizations, including compute, memory, communication, and data types, as a set of composable primitives. Allo preserves the hierarchical structure of the input program and combines these customizations in a bottom-up, type-safe manner.

This approach facilitates holistic optimizations that span across function boundaries, as opposed to the intrusive source-level changes often required by existing high-level synthesis (HLS) tools. The authors evaluate Allo on commonly-used HLS benchmarks as well as realistic deep learning models like GPT-2.

For the PolyBench suite of HLS benchmarks, Allo outperforms state-of-the-art HLS tools and accelerator design languages (ADLs) in all test cases. For the GPT-2 language model, the Allo-generated accelerator achieves 1.7x faster inference latency and 5.4x higher energy efficiency compared to an NVIDIA A100 GPU. These results demonstrate Allo's capability to handle large-scale, hierarchical accelerator designs effectively.

Critical Analysis

The paper provides a compelling case for the Allo programming model as a more productive approach to building efficient hardware accelerators. The authors thoroughly evaluate Allo against existing HLS tools and ADLs, showing significant performance improvements across a range of benchmarks and real-world applications.

However, the paper does not delve into potential limitations or caveats of the Allo approach. For example, it's unclear how Allo would scale to extremely large and complex accelerator designs, or how it would handle applications with highly irregular memory access patterns. Additionally, the paper does not discuss the effort required to define the various hardware customization primitives used by Allo.

Further research could explore the generalizability of the Allo approach, its applicability to a wider range of accelerator use cases, and the tradeoffs involved in adopting the Allo programming model compared to existing alternatives. A more critical analysis of the underlying assumptions and potential weaknesses of the Allo system would also help readers assess its merits and limitations more objectively.

Conclusion

The Allo programming model presented in this paper represents a promising approach to simplifying the design of efficient hardware accelerators. By decoupling algorithm specification from hardware customization, Allo enables more holistic optimizations and better performance compared to existing tools. The authors' comprehensive evaluation demonstrates Allo's capabilities for both common benchmarks and large-scale deep learning models.

While the paper does not address all potential limitations, Allo's strong performance results and innovative composable design suggest it could be a valuable addition to the accelerator design toolbox. Further research and real-world deployment of the Allo system will be needed to fully assess its impact and identify areas for improvement. Overall, this work highlights the potential of new programming models to unlock the power of specialized hardware and drive the next generation of high-performance computing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

4/9/2024

cs.LG cs.AI cs.AR cs.CL

Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

Zongyue Qin, Yunsheng Bai, Atefeh Sohrabizadeh, Zijian Ding, Ziniu Hu, Yizhou Sun, Jason Cong

In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as textit{pragmas}. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler's data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to $22%$, and identifies designs with an average of $1.10times$ and $1.26times$ (up to $8.17times$ and $13.31times$) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.

7/1/2024

cs.LG cs.AI cs.AR

📈

A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies

Yuqing Xiong

This paper consists of three parts. The first part provides a unified programming model for heterogeneous computing with CPU and accelerator (like GPU, FPGA, Google TPU, Atos QPU, and more) technologies. To some extent, this new programming model makes programming across CPUs and accelerators turn into usual programming tasks with common programming languages, and relieves complexity of programming across CPUs and accelerators. It can be achieved by extending file managements in common programming languages, such as C/C++, Fortran, Python, MPI, etc., to cover accelerators as I/O devices. In the second part, we show that all types of computer systems can be reduced to the simplest type of computer system, a single-core CPU computer system with I/O devices, by the unified programming model. Thereby, the unified programming model can truly build the programming of various computer systems on one API (i.e. file managements of common programming languages), and can make programming for various computer systems easier. In third part, we present a new approach to coupled applications computing (like multidisciplinary simulations) by the unified programming model. The unified programming model makes coupled applications computing more natural and easier since it only relies on its own power to couple multiple applications through MPI.

5/31/2024

cs.DC

Fork is All You Needed in Heterogeneous Systems

Zixuan Wang, Jishen Zhao

We present a unified programming model for heterogeneous computing systems. Such systems integrate multiple computing accelerators and memory units to deliver higher performance than CPU-centric systems. Although heterogeneous systems have been adopted by modern workloads such as machine learning, programming remains a critical limiting factor. Conventional heterogeneous programming techniques either impose heavy modifications to the code base or require rewriting the program in a different language. Such programming complexity stems from the lack of a unified abstraction layer for computing and data exchange, which forces each programming model to define its abstractions. However, with the emerging cache-coherent interconnections such as Compute Express Link, we see an opportunity to standardize such architecture heterogeneity and provide a unified programming model. We present CodeFlow, a language runtime system for heterogeneous computing. CodeFlow abstracts architecture computation in programming language runtime and utilizes CXL as a unified data exchange protocol. Workloads written in high-level languages such as C++ and Rust can be compiled to CodeFlow, which schedules different parts of the workload to suitable accelerators without requiring the developer to implement code or call APIs for specific accelerators. CodeFlow reduces programmers' effort in utilizing heterogeneous systems and improves workload performance.

4/9/2024

cs.ET cs.DC