Full-stack evaluation of Machine Learning inference workloads for RISC-V systems

Read original: arXiv:2405.15380 - Published 5/27/2024 by Debjyoti Bhattacharjee, Anmol, Tommaso Marinelli, Karan Pathak, Peter Kourzanov

Full-stack evaluation of Machine Learning inference workloads for RISC-V systems

Overview

Evaluates the performance of machine learning inference workloads on RISC-V systems
Compares the performance of different RISC-V hardware configurations and software optimizations
Provides insights into the suitability of RISC-V for machine learning applications

Plain English Explanation

This paper examines how well RISC-V, an open-source computer processor architecture, can handle machine learning tasks. Machine learning is a type of artificial intelligence that allows computers to learn and improve from data without being explicitly programmed. The researchers tested different RISC-V hardware setups and software optimizations to see how they affected the performance of machine learning workloads, which are the specific tasks the computer has to do for machine learning.

The key findings are that RISC-V can be a viable platform for machine learning, but the performance depends on the specific hardware configuration and software optimizations used. The researchers provide insights into which RISC-V hardware and software configurations work best for different machine learning tasks. This information can help developers choose the right RISC-V system for their machine learning applications.

Technical Explanation

The paper evaluates the performance of machine learning inference workloads on RISC-V systems. The researchers tested a variety of RISC-V hardware configurations, including different CPU core counts, cache sizes, and memory configurations. They also evaluated the impact of software optimizations, such as compiler optimizations and quantization techniques.

The results show that RISC-V can be a viable platform for machine learning inference, but the performance can vary significantly depending on the hardware and software configuration. The researchers found that increasing the number of CPU cores and cache size generally improved performance, but the optimal configuration depended on the specific machine learning workload.

The paper also explores the challenges of evaluating machine learning performance and provides insights into the tradeoffs between different RISC-V hardware and software configurations.

Critical Analysis

The paper provides a comprehensive evaluation of machine learning inference on RISC-V systems, but there are a few potential limitations and areas for further research:

The study focused on a limited set of machine learning workloads and hardware configurations. More extensive testing with a wider range of workloads and hardware setups would be beneficial to get a more complete understanding of RISC-V's capabilities.
The paper does not delve deeply into the usability and performance analysis of the embedded development environment for RISC-V, which could be an important consideration for real-world deployment.
While the paper provides insights into the tradeoffs between hardware and software configurations, it does not offer specific guidance on how to optimize RISC-V systems for different machine learning use cases. Further research in this area could be valuable for developers.

Overall, the paper makes a valuable contribution to understanding the potential of RISC-V for machine learning applications, but additional research and evaluation would help strengthen the conclusions and provide more practical guidance for developers.

Conclusion

This paper presents a comprehensive evaluation of machine learning inference workloads on RISC-V systems. The researchers found that RISC-V can be a suitable platform for machine learning, but the performance depends heavily on the specific hardware and software configurations. The insights provided in this paper can help developers choose the right RISC-V system for their machine learning applications and identify areas for further optimization. As RISC-V continues to evolve, this research will become increasingly valuable in driving the adoption of this open-source architecture for a wide range of machine learning use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Full-stack evaluation of Machine Learning inference workloads for RISC-V systems

Debjyoti Bhattacharjee, Anmol, Tommaso Marinelli, Karan Pathak, Peter Kourzanov

Architectural simulators hold a vital role in RISC-V research, providing a crucial platform for workload evaluation without the need for costly physical prototypes. They serve as a dynamic environment for exploring innovative architectural concepts, enabling swift iteration and thorough analysis of performance metrics. As deep learning algorithms become increasingly pervasive, it is essential to benchmark new architectures with machine learning workloads. The diverse computational kernels used in deep learning algorithms highlight the necessity for a comprehensive compilation toolchain to map to target hardware platforms. This study evaluates the performance of a wide array of machine learning workloads on RISC-V architectures using gem5, an open-source architectural simulator. Leveraging an open-source compilation toolchain based on Multi-Level Intermediate Representation (MLIR), the research presents benchmarking results specifically focused on deep learning inference workloads. Additionally, the study sheds light on current limitations of gem5 when simulating RISC-V architectures, offering insights for future development and refinement.

5/27/2024

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

5/30/2024

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris

Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations. At hardware level, we firstly expand the ALU unit within our proof-of-concept micro-architecture to support configurable fine grained mixed-precision arithmetic operations. Subsequently, we implement multi-pumping to minimize execution latency, with an additional soft SIMD optimization applied for 2-bit operations. At the ISA level, three distinct MAC instructions are encoded extending the RISC-V ISA, and exposed up to the compiler level, each corresponding to a different mixed-precision operational mode. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.

8/14/2024

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.

8/13/2024