Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Read original: arXiv:2403.00877 - Published 5/3/2024 by Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan and 4 others

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Overview

This paper discusses the challenges of training large-scale recommendation models, particularly those based on deep learning.
It explores key issues such as model complexity, data diversity, and distributed training, and presents solutions to address these challenges.
The paper proposes novel techniques like Cross-Silo Federated Learning, DIMAT, and Multi-Level Framework to improve the training process.

Plain English Explanation

Recommendation systems are a crucial part of many online services, helping users discover new content, products, or services they might be interested in. These systems often use complex deep learning models to make their recommendations. However, training these large-scale models can be challenging due to factors like the sheer size of the models, the diversity of data they need to learn from, and the distributed nature of the training process.

The paper examines these key challenges and presents solutions to address them. For example, it introduces a technique called Cross-Silo Federated Learning that allows multiple organizations to collaborate on training a model without sharing their private data. Another solution, DIMAT, helps speed up the training process by merging models trained on different subsets of the data in a decentralized way. The Multi-Level Framework further improves training efficiency for large transformer models.

By addressing these challenges, the techniques described in the paper can help make it easier to develop and deploy high-performance recommendation systems that can better serve the needs of users.

Technical Explanation

The paper identifies several key challenges in training large-scale recommendation models:

Model Complexity: Recommendation models, especially those based on deep learning, can become extremely complex, with millions or even billions of parameters. This makes them computationally expensive to train and deploy.
Data Diversity: Recommendation systems need to learn from diverse datasets that capture the preferences and behaviors of a wide range of users. Aggregating and processing this data can be a significant challenge.
Distributed Training: Recommendation models are often trained in a distributed fashion, with different parts of the model or data spread across multiple servers or devices. Coordinating this distributed training process can be complex and error-prone.

To address these challenges, the paper presents several novel techniques:

[object Object]: This approach allows multiple organizations to collaborate on training a shared recommendation model without sharing their private user data. The model is trained locally on each organization's data and then the updates are aggregated in a secure and privacy-preserving way.
[object Object]: This is a decentralized iterative model aggregation technique that can speed up the training of large-scale recommendation models. It merges partially trained models in a distributed fashion, without the need for a central coordinator.
[object Object]: This framework accelerates the training of large transformer-based recommendation models by introducing a multi-level optimization approach that adapts the training process to the model's architecture.

The paper also discusses other techniques, such as MTDT for multi-task learning and Universal Performance Modeling for optimizing training hyperparameters.

Critical Analysis

The paper presents a comprehensive overview of the key challenges in training large-scale recommendation models and proposes several innovative solutions to address them. The techniques described, such as Cross-Silo Federated Learning and DIMAT, show promising results in improving the efficiency and scalability of the training process.

However, the paper also acknowledges some limitations and areas for further research. For example, the federated learning approach may still face challenges in scenarios where the data distributions across different organizations are significantly different. Additionally, the performance of the proposed techniques may depend on the specific characteristics of the recommendation task and dataset.

It would be valuable to see further empirical evaluations of these techniques on a wider range of recommendation tasks and datasets, as well as comparisons with other state-of-the-art approaches. Exploring the integration of these techniques with other advancements in machine learning, such as few-shot learning or meta-learning, could also lead to further improvements in the training and deployment of large-scale recommendation models.

Conclusion

This paper presents a comprehensive analysis of the key challenges in training large-scale recommendation models and introduces several novel techniques to address these challenges. The proposed solutions, such as Cross-Silo Federated Learning, DIMAT, and the Multi-Level Framework, demonstrate significant potential in improving the efficiency, scalability, and privacy-preserving capabilities of recommendation model training.

By tackling these fundamental issues, the techniques described in the paper can contribute to the development of more powerful and accessible recommendation systems that can better serve the needs of users across a wide range of applications and industries.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Dheevatsa Mudigere, Maxim Naumov

We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales.

5/3/2024

📈

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.

9/6/2024

DTN: Deep Multiple Task-specific Feature Interactions Network for Multi-Task Recommendation

Yaowen Bi, Yuteng Lian, Jie Cui, Jun Liu, Peijian Wang, Guanghui Li, Xuejun Chen, Jinglin Zhao, Hao Wen, Jing Zhang, Zhaoqi Zhang, Wenzhuo Song, Yang Sun, Weiwei Zhang, Mingchen Cai, Guanxing Zhang

Neural-based multi-task learning (MTL) has been successfully applied to many recommendation applications. However, these MTL models (e.g., MMoE, PLE) did not consider feature interaction during the optimization, which is crucial for capturing complex high-order features and has been widely used in ranking models for real-world recommender systems. Moreover, through feature importance analysis across various tasks in MTL, we have observed an interesting divergence phenomenon that the same feature can have significantly different importance across different tasks in MTL. To address these issues, we propose Deep Multiple Task-specific Feature Interactions Network (DTN) with a novel model structure design. DTN introduces multiple diversified task-specific feature interaction methods and task-sensitive network in MTL networks, enabling the model to learn task-specific diversified feature interaction representations, which improves the efficiency of joint representation learning in a general setup. We applied DTN to our company's real-world E-commerce recommendation dataset, which consisted of over 6.3 billion samples, the results demonstrated that DTN significantly outperformed state-of-the-art MTL models. Moreover, during online evaluation of DTN in a large-scale E-commerce recommender system, we observed a 3.28% in clicks, a 3.10% increase in orders and a 2.70% increase in GMV (Gross Merchandise Value) compared to the state-of-the-art MTL models. Finally, extensive offline experiments conducted on public benchmark datasets demonstrate that DTN can be applied to various scenarios beyond recommendations, enhancing the performance of ranking models.

8/26/2024

Hierarchical Learning and Computing over Space-Ground Integrated Networks

Jingyang Zhu, Yuanming Shi, Yong Zhou, Chunxiao Jiang, Linling Kuang

Space-ground integrated networks hold great promise for providing global connectivity, particularly in remote areas where large amounts of valuable data are generated by Internet of Things (IoT) devices, but lacking terrestrial communication infrastructure. The massive data is conventionally transferred to the cloud server for centralized artificial intelligence (AI) models training, raising huge communication overhead and privacy concerns. To address this, we propose a hierarchical learning and computing framework, which leverages the lowlatency characteristic of low-earth-orbit (LEO) satellites and the global coverage of geostationary-earth-orbit (GEO) satellites, to provide global aggregation services for locally trained models on ground IoT devices. Due to the time-varying nature of satellite network topology and the energy constraints of LEO satellites, efficiently aggregating the received local models from ground devices on LEO satellites is highly challenging. By leveraging the predictability of inter-satellite connectivity, modeling the space network as a directed graph, we formulate a network energy minimization problem for model aggregation, which turns out to be a Directed Steiner Tree (DST) problem. We propose a topologyaware energy-efficient routing (TAEER) algorithm to solve the DST problem by finding a minimum spanning arborescence on a substitute directed graph. Extensive simulations under realworld space-ground integrated network settings demonstrate that the proposed TAEER algorithm significantly reduces energy consumption and outperforms benchmarks.

8/27/2024