The Feasibility of Implementing Large-Scale Transformers on Multi-FPGA Platforms

2404.16158

Published 4/26/2024 by Yu Gao, Juan Camilo Vega, Paul Chow

➖

Abstract

FPGAs are rarely mentioned when discussing the implementation of large machine learning applications, such as Large Language Models (LLMs), in the data center. There has been much evidence showing that single FPGAs can be competitive with GPUs in performance for some computations, especially for low latency, and often much more efficient when power is considered. This suggests that there is merit to exploring the use of multiple FPGAs for large machine learning applications. The challenge with using multiple FPGAs is that there is no commonly-accepted flow for developing and deploying multi-FPGA applications, i.e., there are no tools to describe a large application, map it to multiple FPGAs and then deploy the application on a multi-FPGA platform. In this paper, we explore the feasibility of implementing large transformers using multiple FPGAs by developing a scalable multi-FPGA platform and some tools to map large applications to the platform. We validate our approach by designing an efficient multi-FPGA version of the I-BERT transformer and implement one encoder using six FPGAs as a working proof-of-concept to show that our platform and tools work. Based on our proof-of-concept prototype and the estimations of performance using the latest FPGAs compared to GPUs, we conclude that there can be a place for FPGAs in the world of large machine learning applications. We demonstrate a promising first step that shows that with the right infrastructure and tools it is reasonable to continue to explore the possible benefits of using FPGAs for applications such as LLMs.

Create account to get full access

Overview

FPGAs (Field Programmable Gate Arrays) are rarely discussed as a potential implementation for large machine learning applications like Large Language Models (LLMs) in data centers.
However, research shows that FPGAs can match or outperform GPUs (Graphics Processing Units) in certain computations, especially when power efficiency is a concern.
This suggests that using multiple FPGAs could be a viable approach for implementing large machine learning applications.
The challenge is that there are currently no commonly accepted tools or workflows for developing and deploying multi-FPGA applications.

Plain English Explanation

FPGAs are a type of computer chip that can be programmed to perform specific tasks. When it comes to running large machine learning models, like those used for natural language processing, GPUs are usually the go-to choice in data centers. However, research has shown that FPGAs can sometimes match or even outperform GPUs, especially when power efficiency is important.

This means that using multiple FPGAs together could be a good way to run these large machine learning models. The problem is that there aren't any established tools or processes for developing and deploying applications that use multiple FPGAs working together. Researchers need to figure out how to describe a large application, split it across multiple FPGAs, and then actually get it running on a multi-FPGA system.

Technical Explanation

In this paper, the researchers explore the feasibility of using multiple FPGAs to implement large transformer models, which are a key component of many state-of-the-art language models. They developed a scalable multi-FPGA platform and some tools to help map large applications across multiple FPGAs.

As a proof of concept, the researchers designed an efficient multi-FPGA version of the I-BERT transformer model and implemented one encoder using six FPGAs. This demonstrates that their platform and tools can effectively distribute a large machine learning model across multiple FPGAs.

Based on this prototype and estimates of FPGA performance compared to GPUs, the researchers believe there is potential for FPGAs to play a role in running large machine learning applications, like LLMs. However, more work is needed to develop the necessary infrastructure and tools to make it practical.

Critical Analysis

The paper presents a promising first step towards using multiple FPGAs for large machine learning models, but there are still some challenges to overcome. The researchers acknowledge that their multi-FPGA platform and tools are still prototypes, and more development is needed to make them production-ready.

Additionally, the performance comparisons between FPGAs and GPUs are based on estimates rather than comprehensive benchmarks. Further research is needed to fully understand the relative strengths and weaknesses of each hardware platform for different machine learning workloads.

There are also questions about the scalability of their approach. The proof of concept used six FPGAs, but it's unclear how well the platform and tools would handle larger numbers of FPGAs or more complex models. Efficient model distillation techniques may also be needed to deploy transformer-based models on FPGA hardware.

Conclusion

This paper demonstrates that it is feasible to use multiple FPGAs to implement large machine learning models, like transformers, which are a key component of many state-of-the-art language models. While more work is needed to develop the necessary infrastructure and tools, the researchers' proof of concept suggests that FPGAs could play a role in running large-scale machine learning applications in the future, potentially offering advantages in terms of performance and power efficiency.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

4/9/2024

cs.LG cs.AI cs.AR cs.CL

🤯

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Andy He, Darren Key, Mason Bulling, Andrew Chang, Skyler Shapiro, Everett Lee

Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on https://github.com/HLSTransform/submission.

5/3/2024

cs.AR cs.AI cs.LG

🏷️

Investigating Resource-efficient Neutron/Gamma Classification ML Models Targeting eFPGAs

Jyothisraj Johnson, Billy Boxer, Tarun Prakash, Carl Grace, Peter Sorensen, Mani Tripathi

There has been considerable interest and resulting progress in implementing machine learning (ML) models in hardware over the last several years from the particle and nuclear physics communities. A big driver has been the release of the Python package, hls4ml, which has enabled porting models specified and trained using Python ML libraries to register transfer level (RTL) code. So far, the primary end targets have been commercial FPGAs or synthesized custom blocks on ASICs. However, recent developments in open-source embedded FPGA (eFPGA) frameworks now provide an alternate, more flexible pathway for implementing ML models in hardware. These customized eFPGA fabrics can be integrated as part of an overall chip design. In general, the decision between a fully custom, eFPGA, or commercial FPGA ML implementation will depend on the details of the end-use application. In this work, we explored the parameter space for eFPGA implementations of fully-connected neural network (fcNN) and boosted decision tree (BDT) models using the task of neutron/gamma classification with a specific focus on resource efficiency. We used data collected using an AmBe sealed source incident on Stilbene, which was optically coupled to an OnSemi J-series SiPM to generate training and test data for this study. We investigated relevant input features and the effects of bit-resolution and sampling rate as well as trade-offs in hyperparameters for both ML architectures while tracking total resource usage. The performance metric used to track model performance was the calculated neutron efficiency at a gamma leakage of 10$^{-3}$. The results of the study will be used to aid the specification of an eFPGA fabric, which will be integrated as part of a test chip.

4/24/2024

cs.LG

A Survey on Large Language Models from Concept to Implementation

Chen Wang, Jin Zhao, Jiaqi Gong

Recent advancements in Large Language Models (LLMs), particularly those built on Transformer architectures, have significantly broadened the scope of natural language processing (NLP) applications, transcending their initial use in chatbot technology. This paper investigates the multifaceted applications of these models, with an emphasis on the GPT series. This exploration focuses on the transformative impact of artificial intelligence (AI) driven tools in revolutionizing traditional tasks like coding and problem-solving, while also paving new paths in research and development across diverse industries. From code interpretation and image captioning to facilitating the construction of interactive systems and advancing computational domains, Transformer models exemplify a synergy of deep learning, data analysis, and neural network design. This survey provides an in-depth look at the latest research in Transformer models, highlighting their versatility and the potential they hold for transforming diverse application sectors, thereby offering readers a comprehensive understanding of the current and future landscape of Transformer-based LLMs in practical applications.

5/29/2024

cs.CL cs.AI cs.IT cs.LG