H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Read original: arXiv:2408.09209 - Published 8/20/2024 by Mario Doumet, Marius Stan, Mathew Hall, Vaughn Betz

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Overview

Presents H2PIPE, a high-throughput CNN inference architecture for FPGAs with high-bandwidth memory
Achieves up to 11.7 TOPS throughput while maintaining energy efficiency
Demonstrates performance improvements over state-of-the-art FPGA and GPU inference accelerators

Plain English Explanation

The paper introduces H2PIPE, a new system designed to run convolutional neural network (CNN) models quickly and efficiently on field-programmable gate arrays (FPGAs) equipped with high-bandwidth memory. CNNs are a type of machine learning model commonly used for image recognition and other tasks.

The key innovation in H2PIPE is its ability to process CNN inputs and perform inferences at a very high throughput, up to 11.7 trillion operations per second (TOPS). This is achieved through a carefully designed hardware architecture that takes advantage of the high memory bandwidth available in modern FPGA systems.

By optimizing the CNN inference process for FPGAs, H2PIPE is able to outperform both state-of-the-art FPGA and GPU-based inference accelerators in terms of throughput, while also maintaining good energy efficiency. This makes it a promising solution for applications that require fast and power-efficient CNN inference, such as autonomous vehicles, real-time image processing, and edge computing.

Technical Explanation

The paper introduces the H2PIPE architecture, which is designed to accelerate CNN inference on FPGAs equipped with high-bandwidth memory. The key components of the H2PIPE architecture include:

Hybrid-grained Parallelism: H2PIPE employs a combination of coarse-grained and fine-grained parallelism to maximize resource utilization and throughput.
Memory-Centric Design: The architecture is designed to fully leverage the high-bandwidth memory capabilities of modern FPGAs, minimizing data movement and enabling high-throughput CNN inference.
Dataflow-based Execution: H2PIPE uses a dataflow-based execution model to pipeline the CNN inference process, further improving throughput.

The paper evaluates H2PIPE on several CNN models, including ResNet-50 and MobileNet-V2, and compares its performance to state-of-the-art FPGA and GPU-based inference accelerators. The results show that H2PIPE achieves up to 11.7 TOPS of throughput, significantly outperforming the competition while maintaining good energy efficiency.

Critical Analysis

The paper provides a thorough technical explanation of the H2PIPE architecture and its key innovations. The authors have clearly put a lot of thought into designing a high-performance CNN inference system that can take advantage of the latest FPGA hardware capabilities.

One potential area of concern is the scalability of the H2PIPE approach. The paper focuses on a specific set of CNN models and FPGA hardware, and it's unclear how well the architecture would generalize to a wider range of models or future hardware advancements. Additionally, the paper does not address potential issues with the complexity or programmability of the H2PIPE design, which could be a barrier to adoption in real-world scenarios.

Further research could explore the applicability of the H2PIPE approach to a broader range of CNN models, as well as its performance on more recent FPGA hardware with even higher memory bandwidth. Investigating the ease of use and integration of the H2PIPE architecture into existing deep learning frameworks and workflows would also be valuable.

Conclusion

The H2PIPE architecture presented in this paper represents a significant advancement in the field of high-throughput CNN inference on FPGAs. By leveraging the high-bandwidth memory capabilities of modern FPGA systems, the authors have been able to achieve impressive throughput and energy efficiency, outperforming state-of-the-art FPGA and GPU-based inference accelerators.

This work has the potential to enable a new class of real-time, power-efficient applications that rely on CNN-based computer vision, such as autonomous vehicles, robotics, and edge computing. Further research and development of the H2PIPE approach could lead to even more widespread adoption and impact in the field of deep learning hardware acceleration.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

Mario Doumet, Marius Stan, Mathew Hall, Vaughn Betz

Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. Field Programmable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM's latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.

8/20/2024

HG-PIPE: Vision Transformer Acceleration with Hybrid-Grained Pipeline

Qingyu Guo, Jiayong Wan, Songqiang Xu, Meng Li, Yuan Wang

Vision Transformer (ViT) acceleration with field programmable gate array (FPGA) is promising but challenging. Existing FPGA-based ViT accelerators mainly rely on temporal architectures, which process different operators by reusing the same hardware blocks and suffer from extensive memory access overhead. Pipelined architectures, either coarse-grained or fine-grained, unroll the ViT computation spatially for memory access efficiency. However, they usually suffer from significant hardware resource constraints and pipeline bubbles induced by the global computation dependency of ViT. In this paper, we introduce HG-PIPE, a pipelined FPGA accelerator for high-throughput and low-latency ViT processing. HG-PIPE features a hybrid-grained pipeline architecture to reduce on-chip buffer cost and couples the computation dataflow and parallelism design to eliminate the pipeline bubbles. HG-PIPE further introduces careful approximations to implement both linear and non-linear operators with abundant Lookup Tables (LUTs), thus alleviating resource constraints. On a ZCU102 FPGA, HG-PIPE achieves 2.78 times better throughput and 2.52 times better resource efficiency than the prior-art accelerators, e.g., AutoViTAcc. With a VCK190 FPGA, HG-PIPE realizes end-to-end ViT acceleration on a single device and achieves 7118 images/s, which is 2.81 times faster than a V100 GPU.

8/2/2024

🧠

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator

Federico Nicolas Peccia, Svetlana Pavlitska, Tobias Fleck, Oliver Bringmann

The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.

8/15/2024

FPCA: Field-Programmable Pixel Convolutional Array for Extreme-Edge Intelligence

Zihan Yin, Akhilesh Jaiswal

The rapid advancement of neural network applications necessitates hardware that not only accelerates computation but also adapts efficiently to dynamic processing requirements. While processing-in-pixel has emerged as a promising solution to overcome the bottlenecks of traditional architectures at the extreme-edge, existing implementations face limitations in reconfigurability and scalability due to their static nature and inefficient area usage. Addressing these challenges, we present a novel architecture that significantly enhances the capabilities of processing-in-pixel for convolutional neural networks. Our design innovatively integrates non-volatile memory (NVM) with novel unit pixel circuit design, enabling dynamic reconfiguration of synaptic weights, kernel size, channel size and stride size. Thus offering unprecedented flexibility and adaptability. With using a separate die for pixel circuit and storing synaptic weights, our circuit achieves a substantial reduction in the required area per pixel thereby increasing the density and scalability of the pixel array. Simulation results demonstrate dot product operations of the circuit, the non-linearity of its analog output and a novel bucket-select curvefit model is proposed to capture it. This work not only addresses the limitations of current in-pixel computing approaches but also opens new avenues for developing more efficient, flexible, and scalable neural network hardware, paving the way for advanced AI applications.

8/21/2024