DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

Read original: arXiv:2304.01468 - Published 7/1/2024 by Qinlong Wang, Tingfeng Lan, Yinghao Tang, Ziling Huang, Yiheng Du, Haitao Zhang, Jian Sha, Hui Lu, Yuanchun Zhou, Ke Zhang and 1 other

🛠️

Overview

Deep learning recommendation models (DLRMs) rely on large embedding tables to manage categorical sparse features
Expanding these embedding tables can improve model performance, but increases resource usage
Tech companies have built cloud services to accelerate DLRM training at scale
This paper investigates DLRM training challenges at AntGroup and introduces DLRover-RM, an elastic training framework to address them

Plain English Explanation

Deep learning recommendation models (DLRMs) are a type of machine learning model used for personalizing recommendations, such as product suggestions on an e-commerce website. These models rely on embedding tables, which are large data structures that store information about different categories of data, like product types or user demographics.

Increasing the size of these embedding tables can significantly improve the performance of DLRM models, but it also requires more computing resources like GPU, CPU, and memory. To help companies train these large, resource-intensive models, tech giants have built cloud-based services to accelerate the training process.

In this paper, the researchers investigated the challenges of training DLRMs on these cloud platforms at AntGroup, a major tech company. They found two key issues: low resource utilization due to suboptimal configurations by users, and the tendency to encounter abnormalities due to the unstable nature of cloud environments.

To address these problems, the researchers developed DLRover-RM, an elastic training framework designed to increase resource utilization and handle the instability of the cloud. DLRover-RM uses a resource-performance model to automatically allocate and dynamically adjust resources for DLRM training jobs, and it includes several mechanisms to ensure efficient and reliable execution of these jobs.

Technical Explanation

The paper first highlights the importance of large embedding tables in deep learning recommendation models (DLRMs). Expanding these embedding tables can significantly enhance model performance, but it also leads to increased GPU, CPU, and memory usage.

The researchers then discuss how major tech companies have built cloud-based services to accelerate the training of DLRM models at scale. However, their investigation of the DLRM training platforms at AntGroup revealed two critical challenges: low resource utilization due to suboptimal configurations by users, and the tendency to encounter abnormalities due to the unstable nature of cloud environments.

To address these challenges, the researchers introduce DLRover-RM, an elastic training framework for DLRMs. DLRover-RM develops a resource-performance model that considers the unique characteristics of DLRMs, and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs. This helps to increase resource utilization.

Furthermore, DLRover-RM implements multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs, such as dynamic resource adjustment and job failure handling.

The paper presents extensive evaluations showing that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. It has also been open-sourced and adopted by 10+ companies.

Critical Analysis

The paper provides a thorough investigation of the challenges faced in training DLRM models at scale on cloud platforms, and presents a comprehensive solution in the form of the DLRover-RM framework. The researchers have clearly identified the key issues of low resource utilization and the instability of cloud environments, and have designed DLRover-RM to address these problems effectively.

One potential area for further research could be the exploration of transfer learning or meta-learning techniques to improve the performance of the resource-performance model, which is a critical component of DLRover-RM. Additionally, the researchers could investigate the applicability of their framework to other types of machine learning models beyond DLRMs.

Overall, the paper presents a significant contribution to the field of large-scale machine learning model training, and the DLRover-RM framework appears to be a practical and effective solution for companies facing the challenges of training resource-intensive DLRM models in the cloud.

Conclusion

This paper tackles the critical challenges of training deep learning recommendation models (DLRMs) at scale on cloud platforms. The researchers identified two key issues: low resource utilization due to suboptimal configurations, and the tendency to encounter abnormalities in unstable cloud environments.

To address these problems, the researchers developed DLRover-RM, an elastic training framework for DLRMs. DLRover-RM uses a resource-performance model and a heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs, increasing utilization. It also includes mechanisms to ensure efficient and reliable execution of these jobs.

The extensive evaluations show that DLRover-RM significantly outperforms state-of-the-art resource scheduling frameworks, reducing job completion times, increasing job completion rates, and improving CPU and memory utilization. DLRover-RM has been widely deployed at AntGroup and adopted by over 10 companies, demonstrating its practical value in the industry.

This research represents an important advancement in the field of large-scale machine learning model training, and the DLRover-RM framework can potentially be adapted to benefit a wide range of companies and applications relying on resource-intensive models like DLRMs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

Qinlong Wang, Tingfeng Lan, Yinghao Tang, Ziling Huang, Yiheng Du, Haitao Zhang, Jian Sha, Hui Lu, Yuanchun Zhou, Ke Zhang, Mingjie Tang

Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover-RM, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover-RM develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover-RM develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover-RM is open-sourced and has been adopted by 10+ companies.

7/1/2024

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, Pavan Balaji

Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due to their intensive needs on memory capacity and memory bandwidth. In this paper, we propose UpDLRM, which utilizes real-world processingin-memory (PIM) hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. The parallel nature of the DPU memory can provide high aggregated bandwidth for the large number of irregular memory accesses in embedding lookups, thus offering great potential to reduce the inference latency. To fully utilize the DPU memory bandwidth, we further studied the embedding table partitioning problem to achieve good workload-balance and efficient data caching. Evaluations using real-world datasets show that, UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.

6/21/2024

Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

Hao Feng, Boyuan Zhang, Fanjiang Ye, Min Si, Ching-Hsiang Chu, Jiannan Tian, Chunxing Yin, Summer Deng, Yuchen Hao, Pavan Balaji, Tong Geng, Dingwen Tao

DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on accuracy. We further optimize our compressor for PyTorch tensors on GPUs, minimizing compression overhead. Evaluation shows that our method achieves a 1.38$times$ training speedup with a minimal accuracy impact.

8/27/2024

Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval

Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu

Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous heterogeneous fine-tuning collections from different domains. However, the discussion about its training data distribution is still minimal. Previous studies rely on empirically assigned dataset choices or sampling ratios, which inevitably leads to sub-optimal retrieval performances. In this paper, we propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for LLM-DR fine-tuning, targeted at improving the universal domain generalization ability by end-to-end reweighting the data distribution of each task. The tDRO parameterizes the domain weights and updates them with scaled domain gradients. The optimized weights are then transferred to the LLM-DR fine-tuning to train more robust retrievers. Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models.

8/21/2024