Vectorization of Gradient Boosting of Decision Trees Prediction in the CatBoost Library for RISC-V Processors

Read original: arXiv:2405.11062 - Published 5/21/2024 by Evgeny Kozinov, Evgeny Vasiliev, Andrey Gorshkov, Valentina Kustikova, Artem Maklaev, Valentin Volokitin, Iosif Meyerov

🔮

Overview

The paper explores the optimization of the CatBoost library, a widely used gradient boosting library, for RISC-V CPUs with the RVV 0.7.1 vector extension.
It highlights the need for manual vectorization of computationally intensive loops to effectively utilize the resources of RISC-V CPUs, as C++ compilers cannot do this automatically yet.
The paper presents the authors' experience in benchmarking CatBoost on a RISC-V-based board and demonstrates how manual vectorization can significantly speed up the use of decision trees.

Plain English Explanation

The paper focuses on optimizing the CatBoost library, a popular machine learning algorithm, for RISC-V processors. RISC-V is an open-source instruction set architecture that is gaining traction in various applications, from low-power IoT devices to high-performance servers.

To take full advantage of RISC-V CPUs, the software needs to be optimized for the specific hardware. In the case of CatBoost, which is designed for commodity CPUs and GPUs, the authors found that manual vectorization of the computationally intensive parts of the code was required to effectively utilize the RVV 0.7.1 vector extension on RISC-V.

Vectorization is a technique that allows the processor to perform multiple operations simultaneously, similar to how a person can carry multiple items at once. By manually implementing this optimization, the authors were able to achieve significant performance improvements for decision tree-based machine learning models on a RISC-V-based board called the Lichee Pi 4a.

The paper shares the authors' experiences and the publicly available code they developed, which can be useful for others working on optimizing machine learning libraries for RISC-V platforms.

Technical Explanation

The paper focuses on the optimization of the CatBoost library, one of the widely used implementations of gradient boosting for decision trees, for RISC-V CPUs with the RVV 0.7.1 vector extension.

The CatBoost library is highly optimized for commodity CPUs and GPUs, but the authors found that manual vectorization of the computationally intensive loops was required to effectively utilize the resources of RISC-V CPUs. This is because the C++ compiler cannot automatically perform this optimization for the RISC-V architecture, which is still an ongoing challenge in the development of high-performance AI compilers.

The authors benchmarked the CatBoost library on the Lichee Pi 4a, a RISC-V-based board, and found that manual vectorization with intrinsics (low-level hardware instructions) can significantly speed up the use of decision trees, depending on the specific workload. The paper provides the developed codes, which are publicly available on GitHub, as a reference for others working on similar optimizations.

Critical Analysis

The paper provides a practical example of the challenges and opportunities in optimizing machine learning libraries for emerging architectures like RISC-V. The authors acknowledge the limitations of current C++ compilers in automatically vectorizing code for RISC-V, which highlights the need for further research and development in this area.

While the paper demonstrates the effectiveness of manual vectorization, it would be interesting to see how the performance of the optimized CatBoost library compares to other RISC-V-specific implementations or alternative machine learning frameworks. Additionally, the authors could have explored the trade-offs between the performance gains and the increased complexity of manual optimization, as well as the potential for automation or domain-specific compiler techniques to simplify this process in the future.

Conclusion

This paper showcases the importance of hardware-specific optimizations to effectively utilize emerging architectures like RISC-V. The authors' work on optimizing the CatBoost library for RISC-V CPUs with manual vectorization demonstrates the potential for significant performance improvements in machine learning applications on these platforms.

The publicly available code and the insights provided in the paper can be valuable for researchers and developers working on optimizing software for RISC-V or exploring the use of gradient boosting techniques on specialized hardware. As the RISC-V ecosystem continues to grow, these types of optimization efforts will play a crucial role in unlocking the full potential of this open-source instruction set architecture.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔮

Vectorization of Gradient Boosting of Decision Trees Prediction in the CatBoost Library for RISC-V Processors

Evgeny Kozinov, Evgeny Vasiliev, Andrey Gorshkov, Valentina Kustikova, Artem Maklaev, Valentin Volokitin, Iosif Meyerov

The emergence and rapid development of the open RISC-V instruction set architecture opens up new horizons on the way to efficient devices, ranging from existing low-power IoT boards to future high-performance servers. The effective use of RISC-V CPUs requires software optimization for the target platform. In this paper, we focus on the RISC-V-specific optimization of the CatBoost library, one of the widely used implementations of gradient boosting for decision trees. The CatBoost library is deeply optimized for commodity CPUs and GPUs. However, vectorization is required to effectively utilize the resources of RISC-V CPUs with the RVV 0.7.1 vector extension, which cannot be done automatically with a C++ compiler yet. The paper reports on our experience in benchmarking CatBoost on the Lichee Pi 4a, RISC-V-based board, and shows how manual vectorization of computationally intensive loops with intrinsics can speed up the use of decision trees several times, depending on the specific workload. The developed codes are publicly available on GitHub.

5/21/2024

RISC-V RVV efficiency for ANN algorithms

Konstantin Rumyantsev, Pavel Yakovlev, Andrey Gorshkov, Andrey P. Sokolov

Handling vast amounts of data is crucial in today's world. The growth of high-performance computing has created a need for parallelization, particularly in the area of machine learning algorithms such as ANN (Approximate Nearest Neighbors). To improve the speed of these algorithms, it is important to optimize them for specific processor architectures. RISC-V (Reduced Instruction Set Computer Five) is one of the modern processor architectures, which features a vector instruction set called RVV (RISC-V Vector Extension). In machine learning algorithms, vector extensions are widely utilized to improve the processing of voluminous data. This study examines the effectiveness of applying RVV to commonly used ANN algorithms. The algorithms were adapted for RISC-V and optimized using RVV after identifying the primary bottlenecks. Additionally, we developed a theoretical model of a parameterized vector block and identified the best on average configuration that demonstrates the highest theoretical performance of the studied ANN algorithms when the other CPU parameters are fixed.

7/19/2024

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

5/30/2024

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris

Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations. At hardware level, we firstly expand the ALU unit within our proof-of-concept micro-architecture to support configurable fine grained mixed-precision arithmetic operations. Subsequently, we implement multi-pumping to minimize execution latency, with an additional soft SIMD optimization applied for 2-bit operations. At the ISA level, three distinct MAC instructions are encoded extending the RISC-V ISA, and exposed up to the compiler level, each corresponding to a different mixed-precision operational mode. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.

8/14/2024