A structure-aware framework for learning device placements on computation graphs

Read original: arXiv:2405.14185 - Published 5/24/2024 by Shukai Duan, Heng Ping, Nikos Kanakaris, Xiongye Xiao, Peiyu Zhang, Panagiotis Kyriakis, Nesreen K. Ahmed, Guixiang Ma, Mihai Capota, Shahin Nazarian and 2 others

🤔

Overview

Existing approaches for device placement ignore the topological features of computation graphs and rely mostly on heuristic methods for graph partitioning.
They either follow a grouper-placer or an encoder-placer architecture, which requires understanding the interaction structure between code operations.
This paper proposes a novel framework for device placement, using smaller computation graphs extracted from the OpenVINO toolkit and reinforcement learning.
The framework consists of five steps, including graph coarsening, node representation learning, and policy optimization.
It facilitates end-to-end training and considers the directed and acyclic nature of the computation graphs.
The paper also proposes a model variant inspired by graph parsing networks and complex network analysis, enabling joint graph representation learning and personalized graph partitioning.

Plain English Explanation

When it comes to device placement, existing approaches often ignore the inherent structure and connections within the computational graphs, relying instead on basic heuristic methods for dividing the graph. These techniques either follow a "grouper-placer" or an "encoder-placer" architecture, which requires understanding the intricate relationships between different operations in the code.

To address this, the researchers in this paper propose a novel framework that uses smaller computational graphs extracted from the OpenVINO toolkit and reinforcement learning to determine the optimal device placement. The framework goes through five key steps, including simplifying the graph (graph coarsening), learning how to represent the nodes (node representation learning), and optimizing the placement policy. This end-to-end approach takes into account the directed and acyclic nature of the computational graphs, which is an important consideration.

The paper also introduces a model variant inspired by graph parsing networks and complex network analysis. This variant can simultaneously learn how to represent the graph and partition it into an unspecified number of groups, tailoring the placement to the specific computational graph.

Technical Explanation

The key aspects of this paper's technical approach are:

Graph Coarsening: The researchers start by extracting smaller computation graphs from the larger OpenVINO toolkit using a graph coarsening technique. This simplifies the problem and makes it more manageable for the reinforcement learning-based optimization.
Node Representation Learning: Next, they learn numerical representations for each node in the simplified computation graphs. This allows the model to capture the important features and relationships within the graph structure.
Policy Optimization: The core of the framework is a reinforcement learning-based optimization process that learns the optimal device placement policy. The model is trained to maximize the execution speed of the suggested placements.
Graph Parsing Variant: The paper also introduces a model variant that combines graph representation learning and personalized graph partitioning. This enables the framework to adaptively partition the computation graph into an unspecified number of groups, tailoring the placement to the specific graph structure.

The researchers evaluate their approach on three benchmark models: Inception-V3, ResNet, and BERT. They demonstrate significant improvements in inference speed, with up to 58.2% faster execution compared to CPU-only and up to 60.24% faster than other baselines.

Critical Analysis

The researchers have proposed an innovative approach that addresses the limitations of existing device placement techniques. By leveraging reinforcement learning and considering the topological structure of computation graphs, the framework can generate more efficient placements.

However, the paper does not provide a detailed discussion of the limitations or potential issues with the proposed methods. For example, the computational complexity of the graph coarsening and node representation learning steps could be a concern, especially for larger and more complex graphs. Additionally, the reliance on the OpenVINO toolkit may limit the generalizability of the approach to other frameworks or custom computational graphs.

Further research could explore the scalability of the framework, its performance on a wider range of models and hardware configurations, and the potential trade-offs between placement quality and computational efficiency. Incorporating additional constraints, such as power consumption or memory usage, could also enhance the practical applicability of the framework.

Conclusion

This paper presents a novel device placement framework that addresses the shortcomings of existing approaches by leveraging reinforcement learning and incorporating the topological features of computation graphs. The proposed solution demonstrates significant improvements in inference speed across various benchmark models, showcasing the potential of this data-driven approach to optimizing hardware utilization.

The paper's innovative techniques, such as the graph coarsening, node representation learning, and the graph parsing variant, provide a solid foundation for further advancements in the field of automated hardware-software co-design and constrained object placement. As the complexity of computational workloads continues to grow, this research highlights the importance of considering the structural properties of computation graphs in developing efficient device placement strategies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

A structure-aware framework for learning device placements on computation graphs

Shukai Duan, Heng Ping, Nikos Kanakaris, Xiongye Xiao, Peiyu Zhang, Panagiotis Kyriakis, Nesreen K. Ahmed, Guixiang Ma, Mihai Capota, Shahin Nazarian, Theodore L. Willke, Paul Bogdan

Existing approaches for device placement ignore the topological features of computation graphs and rely mostly on heuristic methods for graph partitioning. At the same time, they either follow a grouper-placer or an encoder-placer architecture, which requires understanding the interaction structure between code operations. To bridge the gap between encoder-placer and grouper-placer techniques, we propose a novel framework for the task of device placement, relying on smaller computation graphs extracted from the OpenVINO toolkit using reinforcement learning. The framework consists of five steps, including graph coarsening, node representation learning and policy optimization. It facilitates end-to-end training and takes into consideration the directed and acyclic nature of the computation graphs. We also propose a model variant, inspired by graph parsing networks and complex network analysis, enabling graph representation learning and personalized graph partitioning jointly, using an unspecified number of groups. To train the entire framework, we utilize reinforcement learning techniques by employing the execution time of the suggested device placements to formulate the reward. We demonstrate the flexibility and effectiveness of our approach through multiple experiments with three benchmark models, namely Inception-V3, ResNet, and BERT. The robustness of the proposed framework is also highlighted through an ablation study. The suggested placements improve the inference speed for the benchmark models by up to $58.2%$ over CPU execution and by up to $60.24%$ compared to other commonly used baselines.

5/24/2024

🧠

Graph Neural Networks and Reinforcement Learning for Proactive Application Image Placement

Antonios Makris, Theodoros Theodoropoulos, Evangelos Psomakelis, Emanuele Carlini, Matteo Mordacchini, Patrizio Dazzi, Konstantinos Tserpes

The shift from Cloud Computing to a Cloud-Edge continuum presents new opportunities and challenges for data-intensive and interactive applications. Edge computing has garnered a lot of attention from both industry and academia in recent years, emerging as a key enabler for meeting the increasingly strict demands of Next Generation applications. In Edge computing the computations are placed closer to the end-users, to facilitate low-latency and high-bandwidth applications and services. However, the distributed, dynamic, and heterogeneous nature of Edge computing, presents a significant challenge for service placement. A critical aspect of Edge computing involves managing the placement of applications within the network system to minimize each application's runtime, considering the resources available on system devices and the capabilities of the system's network. The placement of application images must be proactively planned to minimize image tranfer time, and meet the strict demands of the applications. In this regard, this paper proposes an approach for proactive image placement that combines Graph Neural Networks and actor-critic Reinforcement Learning, which is evaluated empirically and compared against various solutions. The findings indicate that although the proposed approach may result in longer execution times in certain scenarios, it consistently achieves superior outcomes in terms of application placement.

7/2/2024

DG-RePlAce: A Dataflow-Driven GPU-Accelerated Analytical Global Placement Framework for Machine Learning Accelerators

Andrew B. Kahng, Zhiang Wang

Global placement is a fundamental step in VLSI physical design. The wide use of 2D processing element (PE) arrays in machine learning accelerators poses new challenges of scalability and Quality of Results (QoR) for state-of-the-art academic global placers. In this work, we develop DG-RePlAce, a new and fast GPU-accelerated global placement framework built on top of the OpenROAD infrastructure, which exploits the inherent dataflow and datapath structures of machine learning accelerators. Experimental results with a variety of machine learning accelerators using a commercial 12nm enablement show that, compared with RePlAce (DREAMPlace), our approach achieves an average reduction in routed wirelength by 10% (7%) and total negative slack (TNS) by 31% (34%), with faster global placement and on-par total runtimes relative to DREAMPlace. Empirical studies on the TILOS MacroPlacement Benchmarks further demonstrate that post-route improvements over RePlAce and DREAMPlace may reach beyond the motivating application to machine learning accelerators.

6/21/2024

Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems

Ao Zhou, Jianlei Yang, Tong Qiao, Yingjie Qi, Zhi Yang, Weisheng Zhao, Chunming Hu

The key to device-edge co-inference paradigm is to partition models into computation-friendly and computation-intensive parts across the device and the edge, respectively. However, for Graph Neural Networks (GNNs), we find that simply partitioning without altering their structures can hardly achieve the full potential of the co-inference paradigm due to various computational-communication overheads of GNN operations over heterogeneous devices. We present GCoDE, the first automatic framework for GNN that innovatively Co-designs the architecture search and the mapping of each operation on Device-Edge hierarchies. GCoDE abstracts the device communication process into an explicit operation and fuses the search of architecture and the operations mapping in a unified space for joint-optimization. Also, the performance-awareness approach, utilized in the constraint-based search process of GCoDE, enables effective evaluation of architecture efficiency in diverse heterogeneous systems. We implement the co-inference engine and runtime dispatcher in GCoDE to enhance the deployment efficiency. Experimental results show that GCoDE can achieve up to $44.9times$ speedup and $98.2%$ energy reduction compared to existing approaches across various applications and system configurations.

4/9/2024