GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Read original: arXiv:2303.13775 - Published 6/28/2024 by Sandeep Polisetty, Juelin Liu, Kobi Falus, Yi Ren Fung, Seung-Hwan Lim, Hui Guan, Marco Serafini

🧠

Overview

Graph neural networks (GNNs) are a type of machine learning model that work well on graph-structured data
Training large GNNs requires mini-batch training, which is often done using data parallelism across multiple GPUs
One key performance issue is the loading of input features, which can prevent GPUs from being fully utilized
This paper introduces a new training approach called "split parallelism" to address this issue

Plain English Explanation

Graph neural networks (link) are a powerful type of machine learning model that are great at working with data organized in a graph structure, like social networks or road networks. To train these models on large graphs, researchers often use a technique called mini-batch training, which divides the data into smaller chunks called "mini-batches" and trains the model on them one at a time.

A common way to speed up mini-batch training is to use multiple GPUs in parallel, a technique called "data parallelism." (link) However, the authors of this paper found that a key bottleneck in this process is loading the input features onto the GPUs, which prevents the GPUs from being fully utilized.

To address this issue, the researchers developed a new training approach called "split parallelism." Instead of loading the full input features onto each GPU, split parallelism splits the sampling and training of each mini-batch across the GPUs. This reduces the redundant data loading and allows the GPUs to be used more efficiently.

Technical Explanation

The paper introduces a hybrid parallel mini-batch training paradigm called "split parallelism" to address the performance issues with loading input features in GNN training. (link)

Split parallelism avoids redundant data loads by splitting the sampling and training of each mini-batch across multiple GPUs. A lightweight splitting algorithm is used to divide the work online, at each iteration. This allows the GPUs to be utilized more fully compared to standard data parallelism approaches.

The authors implement split parallelism in a system called "GSplit" and evaluate it against state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$. (link) (link) The results show that GSplit outperforms these baselines, demonstrating the benefits of the split parallelism approach.

Critical Analysis

The paper provides a novel and promising solution to a key performance issue in GNN training. By avoiding redundant data loading, the split parallelism approach can better utilize the available GPU resources.

However, the paper does not explore the potential impact of the splitting algorithm on model convergence or the overall training stability. There may be some trade-offs to consider, as splitting the mini-batch workload could introduce additional complexities or communication overhead.

Additionally, the experiments are limited to a few specific GNN models and datasets. Further research would be needed to understand how well split parallelism generalizes to a wider range of GNN architectures and real-world applications.

Conclusion

This paper presents a new training approach called "split parallelism" that addresses a key performance bottleneck in graph neural network training. By avoiding redundant data loading, split parallelism can better utilize GPU resources and outperform state-of-the-art mini-batch training systems.

While the paper demonstrates the potential benefits of this approach, further research is needed to understand its broader implications and applicability. Nonetheless, the ideas introduced in this work represent an important step forward in improving the efficiency and scalability of GNN training systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Sandeep Polisetty, Juelin Liu, Kobi Falus, Yi Ren Fung, Seung-Hwan Lim, Hui Guan, Marco Serafini

Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. One of the major performance costs in GNN training is the loading of input features, which prevents GPUs from being fully utilized. In this paper, we argue that this problem is exacerbated by redundancies that are inherent to the data parallel approach. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant data loads and splits the sampling and training of each mini-batch across multiple GPUs online, at each iteration, using a lightweight splitting algorithm. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.

6/28/2024

🧠

An Experimental Comparison of Partitioning Strategies for Distributed Graph Neural Network Training

Nikolai Merkel, Daniel Stoll, Ruben Mayer, Hans-Arno Jacobsen

Recently, graph neural networks (GNNs) have gained much attention as a growing area of deep learning capable of learning on graph-structured data. However, the computational and memory requirements for training GNNs on large-scale graphs make it necessary to distribute the training. A prerequisite for distributed GNN training is to partition the input graph into smaller parts that are distributed among multiple machines of a compute cluster. Although graph partitioning has been studied with regard to graph analytics and graph databases, its effect on GNN training performance is largely unexplored. As a consequence, it is unclear whether investing computational efforts into high-quality graph partitioning would pay off in GNN training scenarios. In this paper, we study the effectiveness of graph partitioning for distributed GNN training. Our study aims to understand how different factors such as GNN parameters, mini-batch size, graph type, features size, and scale-out factor influence the effectiveness of graph partitioning. We conduct experiments with two different GNN systems using vertex and edge partitioning. We found that high-quality graph partitioning is a very effective optimization to speed up GNN training and to reduce memory consumption. Furthermore, our results show that invested partitioning time can quickly be amortized by reduced GNN training time, making it a relevant optimization for most GNN scenarios. Compared to research on distributed graph processing, our study reveals that graph partitioning plays an even more significant role in distributed GNN training, which motivates further research on the graph partitioning problem.

8/13/2024

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Vipul Gupta, Xin Chen, Ruoyun Huang, Fanlong Meng, Jianjun Chen, Yujun Yan

Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

7/23/2024

Slicing Input Features to Accelerate Deep Learning: A Case Study with Graph Neural Networks

Zhengjia Xu, Dingyang Lyu, Jinghui Zhang

As graphs grow larger, full-batch GNN training becomes hard for single GPU memory. Therefore, to enhance the scalability of GNN training, some studies have proposed sampling-based mini-batch training and distributed graph learning. However, these methods still have drawbacks, such as performance degradation and heavy communication. This paper introduces SliceGCN, a feature-sliced distributed large-scale graph learning method. SliceGCN slices the node features, with each computing device, i.e., GPU, handling partial features. After each GPU processes its share, partial representations are obtained and concatenated to form complete representations, enabling a single GPU's memory to handle the entire graph structure. This aims to avoid the accuracy loss typically associated with mini-batch training (due to incomplete graph structures) and to reduce inter-GPU communication during message passing (the forward propagation process of GNNs). To study and mitigate potential accuracy reductions due to slicing features, this paper proposes feature fusion and slice encoding. Experiments were conducted on six node classification datasets, yielding some interesting analytical results. These results indicate that while SliceGCN does not enhance efficiency on smaller datasets, it does improve efficiency on larger datasets. Additionally, we found that SliceGCN and its variants have better convergence, feature fusion and slice encoding can make training more stable, reduce accuracy fluctuations, and this study also discovered that the design of SliceGCN has a potentially parameter-efficient nature.

8/22/2024