LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Read original: arXiv:2407.15264 - Published 7/23/2024 by Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu

🏋️

Overview

Graph Neural Networks (GNNs) are widely used in recommendation systems, fraud detection, and node/link classification tasks.
Real-world GNNs often require large memory footprints that exceed the capacities of target GPUs used for training.
Traditional GNN training approaches use graph partitioning and sharding techniques, but these suffer from high computational costs and inefficient communication across GPUs.

Plain English Explanation

The paper introduces a new framework called Large-scale Storage-based Multi-GPU GNN (LSM-GNN) that addresses the limitations of current GNN training approaches. GNNs are a type of machine learning model that can work with graph-structured data, which is commonly used in various applications like recommending products to customers or detecting fraudulent activities.

As GNNs become more complex and handle larger graphs, they require a lot of memory to store the graph data and the learned embeddings (numerical representations) of the nodes and edges. This often exceeds the memory capacity of the GPUs used for training the models.

Traditionally, researchers have tried to solve this problem by splitting the graph data across multiple GPUs, either within a single machine or across multiple machines. However, this approach has two main downsides:

High computational cost: The graph partitioning algorithms used to split the data are computationally expensive.
Inefficient communication: Transferring data between the GPUs is not very efficient, which slows down the training process.

The LSM-GNN framework takes a different approach. Instead of splitting the graph data across GPUs, it stores the data on the CPU's main memory (RAM) and uses the GPU's memory as a cache to store the most relevant parts of the data. This allows the framework to work with much larger graphs than would fit in the GPU's memory alone.

The key innovations in LSM-GNN are:

Novel communication layer: This allows the GPU's software caches to function as a system-wide shared cache with low overhead.
Hybrid eviction policy: This intelligently manages the cache space by using both static and dynamic information about the graph nodes.
Preemptive Victim-buffer Prefetcher (PVP): This mechanism prefetches node feature data from the CPU's main memory to the GPU's cache, further reducing the pressure on the storage devices.

Technical Explanation

The LSM-GNN framework uses a storage-based approach to train GNN models, which enables it to handle larger graphs than traditional approaches.

The core components of LSM-GNN are:

Communication Layer: This novel component allows the GPU's software caches to function as a system-wide shared cache with low overhead, enabling efficient data transfer between the GPU and CPU.
Hybrid Eviction Policy: LSM-GNN incorporates a hybrid eviction policy that intelligently manages the cache space by using both static and dynamic information about the graph nodes. This significantly enhances the cache performance.
Preemptive Victim-buffer Prefetcher (PVP): This mechanism prefetches node feature data from the CPU's pinned-memory "Victim Buffer" to the GPU's cache, reducing the pressure on the storage devices.

The researchers conducted experiments to compare the performance of LSM-GNN against a distributed GNN training baseline (Dist-DGL). Despite having lower compute capabilities and memory capacities, a single node with two GPUs running LSM-GNN outperformed the two-node-four-GPU Dist-DGL setup, providing up to a 3.75x speedup on end-to-end epoch time for large-scale GNN training.

Critical Analysis

The paper presents a novel and promising approach to addressing the memory limitations of GNN training. The authors have identified a real-world problem that is likely to become more prevalent as GNNs continue to scale in size and complexity.

One potential limitation of the LSM-GNN framework is that it relies on the availability of a CPU-based "Victim Buffer" in pinned memory. This may not be a feasible solution in all hardware configurations, and the authors do not explore alternative approaches for systems without this feature.

Additionally, the paper does not provide a detailed analysis of the trade-offs between the computational overhead of the hybrid eviction policy and the potential benefits in cache performance. It would be helpful to understand the scenarios where this policy provides the most significant improvements.

Further research could also explore the scalability of the LSM-GNN approach as the size of the graph and the number of GPUs used for training increases. It would be interesting to see how the performance and efficiency of the framework scale under these conditions.

Conclusion

The LSM-GNN framework presented in this paper offers a novel approach to training large-scale GNN models by leveraging the GPU's software caches and a storage-based approach. The key innovations, including the communication layer, hybrid eviction policy, and Preemptive Victim-buffer Prefetcher, enable LSM-GNN to outperform traditional distributed GNN training setups.

This work has the potential to significantly impact the field of graph machine learning, as it addresses a fundamental challenge in scaling GNN models to handle larger and more complex graphs. By providing a more efficient and cost-effective training solution, LSM-GNN could enable the widespread adoption of GNNs in a variety of real-world applications, from recommendation systems to fraud detection.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu

Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training

7/23/2024

🏋️

GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs on Large Clusters

Jaeyong Song, Hongsun Jang, Jaewon Jung, Youngsok Kim, Jinho Lee

Graph neural networks (GNNs) are one of the rapidly growing fields within deep learning. While many distributed GNN training frameworks have been proposed to increase the training throughput, they face three limitations when applied to multi-server clusters. 1) They suffer from an inter-server communication bottleneck because they do not consider the inter-/intra-server bandwidth gap, a representative characteristic of multi-server clusters. 2) Redundant memory usage and computation hinder the scalability of the distributed frameworks. 3) Sampling methods, de facto standard in mini-batch training, incur unnecessary errors in multi-server clusters. We found that these limitations can be addressed by exploiting the characteristics of multi-server clusters. Here, we propose GraNNDis, a fast distributed GNN training framework for multi-server clusters. Firstly, we present Flexible Preloading, which preloads the essential vertex dependencies server-wise to reduce the low-bandwidth inter-server communications. Secondly, we introduce Cooperative Batching, which enables memory-efficient, less redundant mini-batch training by utilizing high-bandwidth intra-server communications. Thirdly, we propose Expansion-aware Sampling, a cluster-aware sampling method, which samples the edges that affect the system speedup. As sampling the intra-server dependencies does not contribute much to the speedup as they are communicated through fast intra-server links, it only targets a server boundary to be sampled. Lastly, we introduce One-Hop Graph Masking, a computation and communication structure to realize the above methods in multi-server environments. We evaluated GraNNDis on multi-server clusters, and it provided significant speedup over the state-of-the-art distributed GNN training frameworks. GraNNDis is open-sourced at https://github.com/AIS-SNU/GraNNDis_Artifact to facilitate its use.

8/14/2024

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Vipul Gupta, Xin Chen, Ruoyun Huang, Fanlong Meng, Jianjun Chen, Yujun Yan

Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

7/23/2024

All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks

Ajay Jaiswal, Nurendra Choudhary, Ravinarayana Adkathimar, Muthu P. Alagappan, Gaurush Hiranandani, Ying Ding, Zhangyang Wang, Edward W Huang, Karthik Subbian

Graph Neural Networks (GNNs) have attracted immense attention in the past decade due to their numerous real-world applications built around graph-structured data. On the other hand, Large Language Models (LLMs) with extensive pretrained knowledge and powerful semantic comprehension abilities have recently shown a remarkable ability to benefit applications using vision and text data. In this paper, we investigate how LLMs can be leveraged in a computationally efficient fashion to benefit rich graph-structured data, a modality relatively unexplored in LLM literature. Prior works in this area exploit LLMs to augment every node features in an ad-hoc fashion (not scalable for large graphs), use natural language to describe the complex structural information of graphs, or perform computationally expensive finetuning of LLMs in conjunction with GNNs. We propose E-LLaGNN (Efficient LLMs augmented GNNs), a framework with an on-demand LLM service that enriches message passing procedure of graph learning by enhancing a limited fraction of nodes from the graph. More specifically, E-LLaGNN relies on sampling high-quality neighborhoods using LLMs, followed by on-demand neighborhood feature enhancement using diverse prompts from our prompt catalog, and finally information aggregation using message passing from conventional GNN architectures. We explore several heuristics-based active node selection strategies to limit the computational and memory footprint of LLMs when handling millions of nodes. Through extensive experiments & ablation on popular graph benchmarks of varying scales (Cora, PubMed, ArXiv, & Products), we illustrate the effectiveness of our E-LLaGNN framework and reveal many interesting capabilities such as improved gradient flow in deep GNNs, LLM-free inference ability etc.

7/23/2024