Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

2406.19913

Published 7/1/2024 by Fabian Kress, El Mahdi El Annabi, Tim Hotfilter, Julian Hoefer, Tanja Harbaum, Juergen Becker

🤿

Abstract

Distributed systems can be found in various applications, e.g., in robotics or autonomous driving, to achieve higher flexibility and robustness. Thereby, data flow centric applications such as Deep Neural Network (DNN) inference benefit from partitioning the workload over multiple compute nodes in terms of performance and energy-efficiency. However, mapping large models on distributed embedded systems is a complex task, due to low latency and high throughput requirements combined with strict energy and memory constraints. In this paper, we present a novel approach for hardware-aware layer scheduling of DNN inference in distributed embedded systems. Therefore, our proposed framework uses a graph-based algorithm to automatically find beneficial partitioning points in a given DNN. Each of these is evaluated based on several essential system metrics such as accuracy and memory utilization, while considering the respective system constraints. We demonstrate our approach in terms of the impact of inference partitioning on various performance metrics of six different DNNs. As an example, we can achieve a 47.5 % throughput increase for EfficientNet-B0 inference partitioned onto two platforms while observing high energy-efficiency.

Create account to get full access

Overview

Distributed systems are used in various applications like robotics and autonomous driving to achieve higher flexibility and robustness.
Deep Neural Network (DNN) inference can benefit from partitioning the workload across multiple compute nodes in terms of performance and energy-efficiency.
Mapping large DNN models on distributed embedded systems is challenging due to low latency, high throughput, and strict energy/memory constraints.
This paper presents a novel approach for hardware-aware layer scheduling of DNN inference in distributed embedded systems.

Plain English Explanation

Distributed systems, where tasks are spread across multiple computers, are used in applications like robotics and autonomous driving to make them more flexible and reliable. Deep Neural Networks (DNNs), which are a type of advanced AI model, can also benefit from being distributed across multiple computers. This allows the computationally intensive DNN inference process to be split up and run in parallel, improving performance and efficiency.

However, deploying large DNN models on distributed embedded systems (small, specialized computers) is very complex. These embedded systems have strict requirements for low latency (quick response time), high throughput (ability to process lots of data), and limited energy and memory. The paper presents a new approach to automatically divide up the work of running a DNN across multiple embedded computers in a way that meets all these constraints.

Technical Explanation

The paper's proposed framework uses a graph-based algorithm to automatically identify beneficial points to partition a given DNN model for distributed inference. Each potential partitioning point is evaluated based on several key system metrics, such as accuracy, memory utilization, and others, while considering the specific constraints of the target embedded hardware.

The authors demonstrate their approach by analyzing the impact of inference partitioning on various performance metrics for six different DNN models. For example, they were able to achieve a 47.5% throughput increase for the EfficientNet-B0 model by partitioning it across two embedded platforms, while maintaining high energy-efficiency.

Critical Analysis

The paper provides a comprehensive and systematic approach to the challenge of deploying large DNN models on distributed embedded systems. By considering hardware-aware factors, the framework aims to find optimal partitioning points that balance performance, efficiency, and other critical constraints.

However, the evaluation is limited to a fairly narrow set of DNN models and hardware configurations. Further research may be needed to understand how well the approach generalizes to a broader range of models and platforms, especially as DNN architectures and embedded hardware continue to evolve rapidly.

Additionally, the paper does not address the potential complexities of dynamic workloads or the need to reconfigure partitions at runtime. Extending the framework to handle more dynamic deployment scenarios could be an important area for future work.

Conclusion

This research presents a novel solution for mapping large DNN models to distributed embedded systems in a way that optimizes for key performance and efficiency metrics. By automatically identifying beneficial partitioning points, the approach aims to enable the deployment of advanced AI capabilities on resource-constrained edge devices, with potential applications in robotics, autonomous driving, and other domains that require flexible, high-performance, and energy-efficient distributed inference.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Embedded Distributed Inference of Deep Neural Networks: A Systematic Review

Federico Nicol'as Peccia, Oliver Bringmann

Embedded distributed inference of Neural Networks has emerged as a promising approach for deploying machine-learning models on resource-constrained devices in an efficient and scalable manner. The inference task is distributed across a network of embedded devices, with each device contributing to the overall computation by performing a portion of the workload. In some cases, more powerful devices such as edge or cloud servers can be part of the system to be responsible of the most demanding layers of the network. As the demand for intelligent systems and the complexity of the deployed neural network models increases, this approach is becoming more relevant in a variety of applications such as robotics, autonomous vehicles, smart cities, Industry 4.0 and smart health. We present a systematic review of papers published during the last six years which describe techniques and methods to distribute Neural Networks across these kind of systems. We provide an overview of the current state-of-the-art by analysing more than 100 papers, present a new taxonomy to characterize them, and discuss trends and challenges in the field.

5/7/2024

cs.DC

Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems

Chetna Singhal, Yashuo Wu, Francesco Malandrino, Marco Levorato, Carla Fabiana Chiasserini

The increasing pervasiveness of intelligent mobile applications requires to exploit the full range of resources offered by the mobile-edge-cloud network for the execution of inference tasks. However, due to the heterogeneity of such multi-tiered networks, it is essential to make the applications' demand amenable to the available resources while minimizing energy consumption. Modern dynamic deep neural networks (DNN) achieve this goal by designing multi-branched architectures where early exits enable sample-based adaptation of the model depth. In this paper, we tackle the problem of allocating sections of DNNs with early exits to the nodes of the mobile-edge-cloud system. By envisioning a 3-stage graph-modeling approach, we represent the possible options for splitting the DNN and deploying the DNN blocks on the multi-tiered network, embedding both the system constraints and the application requirements in a convenient and efficient way. Our framework -- named Feasible Inference Graph (FIN) -- can identify the solution that minimizes the overall inference energy consumption while enabling distributed inference over the multi-tiered network with the target quality and latency. Our results, obtained for DNNs with different levels of complexity, show that FIN matches the optimum and yields over 65% energy savings relative to a state-of-the-art technique for cost minimization.

4/15/2024

cs.NI eess.SP

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

Zekai Sun, Xiuxian Guan, Junming Wang, Haoze Song, Yuhao Qing, Tianxiang Shen, Dong Huang, Fangming Liu, Heming Cui

The rapid advancements in machine learning techniques have led to significant achievements in various real-world robotic tasks. These tasks heavily rely on fast and energy-efficient inference of deep neural network (DNN) models when deployed on robots. To enhance inference performance, distributed inference has emerged as a promising approach, parallelizing inference across multiple powerful GPU devices in modern data centers using techniques such as data parallelism, tensor parallelism, and pipeline parallelism. However, when deployed on real-world robots, existing parallel methods fail to provide low inference latency and meet the energy requirements due to the limited bandwidth of robotic IoT. We present Hybrid-Parallel, a high-performance distributed inference system optimized for robotic IoT. Hybrid-Parallel employs a fine-grained approach to parallelize inference at the granularity of local operators within DNN layers (i.e., operators that can be computed independently with the partial input, such as the convolution kernel in the convolution layer). By doing so, Hybrid-Parallel enables different operators of different layers to be computed and transmitted concurrently, and overlap the computation and transmission phases within the same inference task. The evaluation demonstrate that Hybrid-Parallel reduces inference time by 14.9% ~41.1% and energy consumption per inference by up to 35.3% compared to the state-of-the-art baselines.

5/30/2024

cs.RO cs.DC

A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings

Madison Threadgill, Andreas Gerstlauer

In the era of deep learning (DL), convolutional neural networks (CNNs), and large language models (LLMs), machine learning (ML) models are becoming increasingly complex, demanding significant computational resources for both inference and training stages. To address this challenge, distributed learning has emerged as a crucial approach, employing parallelization across various devices and environments. This survey explores the landscape of distributed learning, encompassing cloud and edge settings. We delve into the core concepts of data and model parallelism, examining how models are partitioned across different dimensions and layers to optimize resource utilization and performance. We analyze various partitioning schemes for different layer types, including fully connected, convolutional, and recurrent layers, highlighting the trade-offs between computational efficiency, communication overhead, and memory constraints. This survey provides valuable insights for future research and development in this rapidly evolving field by comparing and contrasting distributed learning approaches across diverse contexts.

5/27/2024

cs.LG