Data-driven Forecasting of Deep Learning Performance on GPUs

Read original: arXiv:2407.13853 - Published 7/22/2024 by Seonho Lee, Amar Phanishayee, Divya Mahajan

Data-driven Forecasting of Deep Learning Performance on GPUs

Overview

This paper presents a data-driven approach to forecasting the performance of deep learning models on GPUs.
The authors develop a machine learning model that can predict the runtime and memory usage of deep learning models on different GPU architectures based on their architectural and input characteristics.
The goal is to enable more efficient GPU resource allocation and utilization for deep learning workloads.

Plain English Explanation

The paper explores a way to [object Object] when run on [object Object]. The researchers created a machine learning model that can look at the structure and input size of a deep learning model and [object Object] and how much memory it will use on different types of GPUs.

This could be useful for [object Object] more efficiently when running deep learning workloads, such as training large neural networks. By predicting performance ahead of time, systems could [object Object] and schedule jobs more effectively to avoid bottlenecks and ensure resources are being used optimally.

Technical Explanation

The paper presents a data-driven approach to [object Object] of deep learning models when executed on different GPU architectures. The authors develop a machine learning model that can predict these performance metrics based on the architectural and input characteristics of the deep learning model.

To train and evaluate their prediction model, the researchers collected a large dataset of deep learning workloads running on various GPU hardware. This dataset included details about the deep learning model structure, the input size, the GPU used, and the resulting runtime and memory usage.

Using this dataset, the authors trained a [object Object] that takes as input the characteristics of the deep learning model and GPU, and outputs predictions for the runtime and memory usage. They experimented with different network architectures and hyperparameter settings to optimize the model's accuracy.

The results show that the proposed approach can [object Object] with reasonably low error rates, outperforming baseline methods. The authors discuss how this capability could enable [object Object] for deep learning workloads.

Critical Analysis

The paper presents a compelling approach to [object Object]. However, the authors acknowledge several limitations and avenues for further research.

One key limitation is that the prediction model was trained and evaluated on a fixed set of deep learning models and GPU architectures. Its generalization to new, unseen models and hardware may be limited. Further research is needed to [object Object].

Additionally, the paper does not address the potential for [object Object] when multiple deep learning jobs are running concurrently. Integrating these factors into the performance prediction model could lead to even more accurate and useful forecasts.

Overall, the research represents an important step towards [object Object] for deep learning workloads. With further development and validation, the proposed approach could have significant practical implications for improving the efficiency and utilization of GPU clusters.

Conclusion

This paper presents a data-driven method for [object Object] when executed on GPUs. By developing a machine learning model that can predict runtime and memory usage based on model and hardware characteristics, the authors aim to enable more efficient GPU resource [object Object].

While the current approach has some limitations, the research represents an important step towards [object Object] for deep learning. With further refinement and validation, this work could have significant practical implications for improving the utilization and performance of GPU-accelerated deep learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Data-driven Forecasting of Deep Learning Performance on GPUs

Seonho Lee, Amar Phanishayee, Divya Mahajan

Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.

7/22/2024

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).

4/30/2024

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Konstantin Lubeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Muller, Federico Nicol'as Peccia, Felix Thommes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

9/16/2024

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

Khawir Mahmood, Jehandad Khan, Hammad Afzal

GPU kernels have come to the forefront of comput- ing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical application, which makes their performance highly critical. Due to the unknown nature of the optimization surface, an exhaustive search is required to discover the global optimum, which is infeasible due to the possible exponential number of parameter combinations. In this work, we propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels. This work considers the prediction of kernel parameters as a sequence to the sequence translation problem, borrowing models from the Natural Language Process- ing (NLP) domain. Parameters describing the input, output and weight tensors are considered as the input language to the model that emits the corresponding kernel parameters. In essence, the model translates the problem parameter language to kernel parameter language. The core contributions of this work are: a) Proposing that a sequence to sequence model can accurately learn the performance dynamics of a GPU compute kernel b) A novel network architecture which predicts the kernel tuning parameters for GPU kernels, c) A constrained beam search which incorporates the physical limits of the GPU hardware as well as other expert knowledge reducing the search space. The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library. As a result, the proposed technique can reduce the development time and compute resources required to tune unseen input configurations, resulting in shorter development cycles, reduced development costs, and better user experience.

4/17/2024