GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

Read original: arXiv:2311.06837 - Published 8/14/2024 by Jaeyong Song, Hongsun Jang, Jaewon Jung, Youngsok Kim, Jinho Lee

🏋️

Overview

Graph neural networks (GNNs) are a rapidly growing field in deep learning.
Many distributed GNN training frameworks have been proposed to increase training throughput.
However, these frameworks face three key limitations when applied to multi-server clusters.

Plain English Explanation

Distributed GNN training frameworks aim to speed up the training process by spreading the work across multiple servers. However, when used in multi-server clusters, these frameworks encounter three main problems:

Inter-server Communication Bottleneck: The distributed frameworks do not account for the significant difference in bandwidth between servers (the "inter-/intra-server bandwidth gap"). This leads to a bottleneck in the communication between servers.
Redundant Memory Usage and Computation: The scalability of the distributed frameworks is hindered by inefficient use of memory and redundant computations.
Sampling Issues: The standard mini-batch training approach using sampling can introduce unnecessary errors in multi-server environments.

The researchers found that these limitations can be addressed by exploiting the unique characteristics of multi-server clusters.

Technical Explanation

The authors propose GraNNDis, a fast distributed GNN training framework for multi-server clusters. GraNNDis addresses the three key limitations of existing distributed GNN frameworks:

Flexible Preloading: GraNNDis preloads essential vertex dependencies on a per-server basis to reduce the low-bandwidth inter-server communications.
Cooperative Batching: GraNNDis enables memory-efficient, less redundant mini-batch training by utilizing the high-bandwidth intra-server communications.
Expansion-aware Sampling: GraNNDis introduces a cluster-aware sampling method that only samples the edges affecting the system's speedup, as sampling intra-server dependencies does not contribute much to the speedup.
One-Hop Graph Masking: GraNNDis introduces a computation and communication structure to realize the above methods in multi-server environments.

The authors evaluated GraNNDis on multi-server clusters and found that it provided significant speedup over the state-of-the-art distributed GNN training frameworks.

Critical Analysis

The paper addresses important limitations of existing distributed GNN training frameworks when applied to multi-server clusters. The proposed solutions, such as Flexible Preloading, Cooperative Batching, and Expansion-aware Sampling, seem well-designed to tackle the specific challenges of these environments.

However, the paper does not discuss the potential limitations or drawbacks of the GraNNDis framework. For example, it would be helpful to understand the overhead or complexity introduced by the additional mechanisms, and how they might impact the overall training time or resource utilization.

Additionally, the paper could have explored the generalizability of the proposed techniques to other types of distributed deep learning frameworks, beyond just GNNs. This would help establish the broader applicability of the research.

Conclusion

The GraNNDis framework presented in this paper offers a promising approach to address the limitations of distributed GNN training in multi-server clusters. By leveraging the unique characteristics of these environments, GraNNDis can significantly improve the training throughput compared to existing distributed GNN frameworks.

The innovative techniques, such as Flexible Preloading, Cooperative Batching, and Expansion-aware Sampling, demonstrate the researchers' deep understanding of the challenges faced in multi-server distributed training. As GNNs continue to grow in importance, frameworks like GraNNDis could play a vital role in enabling efficient and scalable deployment of these models in real-world, large-scale applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

Jaeyong Song, Hongsun Jang, Jaewon Jung, Youngsok Kim, Jinho Lee

Graph neural networks (GNNs) are one of the rapidly growing fields within deep learning. While many distributed GNN training frameworks have been proposed to increase the training throughput, they face three limitations when applied to multi-server clusters. 1) They suffer from an inter-server communication bottleneck because they do not consider the inter-/intra-server bandwidth gap, a representative characteristic of multi-server clusters. 2) Redundant memory usage and computation hinder the scalability of the distributed frameworks. 3) Sampling methods, de facto standard in mini-batch training, incur unnecessary errors in multi-server clusters. We found that these limitations can be addressed by exploiting the characteristics of multi-server clusters. Here, we propose GraNNDis, a fast distributed GNN training framework for multi-server clusters. Firstly, we present Flexible Preloading, which preloads the essential vertex dependencies server-wise to reduce the low-bandwidth inter-server communications. Secondly, we introduce Cooperative Batching, which enables memory-efficient, less redundant mini-batch training by utilizing high-bandwidth intra-server communications. Thirdly, we propose Expansion-aware Sampling, a cluster-aware sampling method, which samples the edges that affect the system speedup. As sampling the intra-server dependencies does not contribute much to the speedup as they are communicated through fast intra-server links, it only targets a server boundary to be sampled. Lastly, we introduce One-Hop Graph Masking, a computation and communication structure to realize the above methods in multi-server environments. We evaluated GraNNDis on multi-server clusters, and it provided significant speedup over the state-of-the-art distributed GNN training frameworks. GraNNDis is open-sourced at https://github.com/AIS-SNU/GraNNDis_Artifact to facilitate its use.

8/14/2024

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Vipul Gupta, Xin Chen, Ruoyun Huang, Fanlong Meng, Jianjun Chen, Yujun Yan

Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

7/23/2024

🏋️

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu

Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training

7/23/2024

CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction

Shuai Zhang, Zite Jiang, Haihang You

Graph neural network training is mainly categorized into mini-batch and full-batch training methods. The mini-batch training method samples subgraphs from the original graph in each iteration. This sampling operation introduces extra computation overhead and reduces the training accuracy. Meanwhile, the full-batch training method calculates the features and corresponding gradients of all vertices in each iteration, and therefore has higher convergence accuracy. However, in the distributed cluster, frequent remote accesses of vertex features and gradients lead to huge communication overhead, thus restricting the overall training efficiency. In this paper, we introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN). We propose the adaptive cache mechanism to reduce the remote vertex access by caching the historical features and gradients of neighbor vertices. Besides, we further optimize the communication overhead by quantifying the messages and designing the graph partition algorithm for the hierarchical communication architecture. Experiments show that the adaptive cache mechanism reduces remote vertex accesses by 63.14% on average. Combined with communication quantization and hierarchical GP algorithm, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks by 30.39% in our experiments. Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks.

8/2/2024