Breaking the Memory Wall for Heterogeneous Federated Learning with Progressive Training

2404.13349

Published 4/23/2024 by Yebo Wu, Li Li, Chunlin Tian, Chengzhong Xu

Breaking the Memory Wall for Heterogeneous Federated Learning with Progressive Training

Abstract

This paper presents ProFL, a novel progressive FL framework to effectively break the memory wall. Specifically, ProFL divides the model into different blocks based on its original architecture. Instead of updating the full model in each training round, ProFL first trains the front blocks and safely freezes them after convergence. Training of the next block is then triggered. This process iterates until the training of the whole model is completed. In this way, the memory footprint is effectively reduced for feasible deployment on heterogeneous devices. In order to preserve the feature representation of each block, we decouple the whole training process into two stages: progressive model shrinking and progressive model growing. During the progressive model shrinking stage, we meticulously design corresponding output modules to assist each block in learning the expected feature representation and obtain the initialization parameters. Then, the obtained output modules are utilized in the corresponding progressive model growing stage. Additionally, to control the training pace for each block, a novel metric from the scalar perspective is proposed to assess the learning status of each block and determines when to trigger the training of the next one. Finally, we theoretically prove the convergence of ProFL and conduct extensive experiments on representative models and datasets to evaluate the effectiveness of ProFL. The results demonstrate that ProFL effectively reduces the peak memory footprint by up to 57.4% and improves model accuracy by up to 82.4%.

Create account to get full access

Overview

Proposes a new federated learning approach called ProFL to address the memory constraints of on-device training
Introduces progressive training to gradually increase model complexity and enable effective learning on devices with limited memory
Demonstrates ProFL's effectiveness in breaking the memory wall and achieving higher accuracy compared to traditional federated learning approaches

Plain English Explanation

ProFL: Breaking the Memory Wall for Heterogeneous Federated Learning with Progressive Training is a research paper that introduces a new federated learning technique called ProFL to address the memory limitations of on-device training.

Traditional federated learning approaches often struggle when deployed on devices with limited memory, as they require storing and updating the entire model on each device. ProFL tackles this challenge by using a progressive training approach. Instead of training the full model at once, ProFL gradually increases the model complexity over multiple rounds, allowing devices to effectively learn even with constrained memory.

The key idea behind ProFL is to start with a simple model that can fit on the device's memory, and then progressively expand the model size and complexity in subsequent rounds. This step-by-step approach enables the devices to continuously learn and improve the model without hitting the memory wall.

By breaking the memory constraint, ProFL aims to improve the overall performance and accuracy of federated learning, especially in scenarios where the participating devices have diverse hardware capabilities. This approach can be particularly beneficial for personalized federated learning and other applications where customized models are required on resource-constrained devices.

Technical Explanation

ProFL: Breaking the Memory Wall for Heterogeneous Federated Learning with Progressive Training is a novel federated learning approach that addresses the memory constraints of on-device training.

The key innovation of ProFL is the use of progressive training, where the model complexity is gradually increased over multiple rounds of federated learning. In the first round, a simple model is trained on each device, ensuring it can fit within the device's limited memory. In subsequent rounds, the model size and complexity are incrementally expanded, allowing the devices to continuously learn and improve the model without hitting the memory wall.

The progressive training process works as follows:

The initial round trains a compact base model on each device.
In later rounds, the model is expanded by adding new layers or increasing the number of parameters.
The expanded model is then fine-tuned on the device, building upon the knowledge gained in previous rounds.

This approach enables effective learning on devices with diverse hardware capabilities, including those with limited memory. By breaking the memory constraint, ProFL can achieve higher accuracy compared to traditional federated learning techniques, especially in personalized federated learning and other scenarios where customized models are required on resource-constrained devices.

The authors evaluate ProFL through extensive experiments, demonstrating its effectiveness in improving model performance and breaking the memory wall for heterogeneous federated learning tasks. The results show that ProFL can outperform standard federated learning approaches, particularly in scenarios with devices of varying memory capacities.

Critical Analysis

The ProFL paper presents a promising approach to address the memory constraints of on-device training in federated learning. The progressive training strategy is a clever solution to gradually increase model complexity while ensuring the model can fit within the device's memory.

One potential limitation of the proposed approach is the requirement for multiple rounds of training. While this allows for continuous model improvement, it may result in increased communication overhead and longer training times compared to a single-round federated learning approach. The authors acknowledge this trade-off and suggest exploring ways to optimize the communication and convergence efficiency of the progressive training process.

Additionally, the paper does not address the potential impact of device heterogeneity on the final model quality. While ProFL aims to perform well on devices with diverse hardware capabilities, it would be valuable to investigate the model performance and personalization across a wider range of device types and configurations.

Another area for further research could be the integration of ProFL with other federated learning techniques, such as personalized federated learning or federated continual learning, to leverage the benefits of multiple approaches and further enhance the overall performance and adaptability of the federated learning system.

Conclusion

The ProFL paper introduces a novel federated learning approach called Progressive Federated Learning (ProFL) that addresses the memory constraints of on-device training. By gradually increasing the model complexity through a progressive training strategy, ProFL enables effective learning on devices with limited memory, breaking the so-called "memory wall" that often plagues traditional federated learning techniques.

The key contribution of ProFL is its ability to improve model performance and accuracy, particularly in heterogeneous federated learning scenarios where participating devices have diverse hardware capabilities. This is an important step forward in making federated learning more practical and accessible for a wider range of applications, including personalized models on resource-constrained devices.

While the paper presents promising results, there are opportunities for further research to optimize the communication and convergence efficiency of the progressive training process, as well as explore the integration of ProFL with other federated learning techniques. Overall, the ProFL approach demonstrates the potential to unlock new possibilities in federated learning by overcoming the memory constraints that have long plagued on-device training.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

Multi-level Personalized Federated Learning on Heterogeneous and Long-Tailed Data

Rongyu Zhang, Yun Chen, Chenrui Wu, Fangxin Wang, Bo Li

Federated learning (FL) offers a privacy-centric distributed learning framework, enabling model training on individual clients and central aggregation without necessitating data exchange. Nonetheless, FL implementations often suffer from non-i.i.d. and long-tailed class distributions across mobile applications, e.g., autonomous vehicles, which leads models to overfitting as local training may converge to sub-optimal. In our study, we explore the impact of data heterogeneity on model bias and introduce an innovative personalized FL framework, Multi-level Personalized Federated Learning (MuPFL), which leverages the hierarchical architecture of FL to fully harness computational resources at various levels. This framework integrates three pivotal modules: Biased Activation Value Dropout (BAVD) to mitigate overfitting and accelerate training; Adaptive Cluster-based Model Update (ACMU) to refine local models ensuring coherent global aggregation; and Prior Knowledge-assisted Classifier Fine-tuning (PKCF) to bolster classification and personalize models in accord with skewed local data with shared knowledge. Extensive experiments on diverse real-world datasets for image classification and semantic segmentation validate that MuPFL consistently outperforms state-of-the-art baselines, even under extreme non-i.i.d. and long-tail conditions, which enhances accuracy by as much as 7.39% and accelerates training by up to 80% at most, marking significant advancements in both efficiency and effectiveness.

5/13/2024

cs.AI

Automated Federated Learning via Informed Pruning

Christian Intern`o, Elena Raponi, Niki van Stein, Thomas Back, Markus Olhofer, Yaochu Jin, Barbara Hammer

Federated learning (FL) represents a pivotal shift in machine learning (ML) as it enables collaborative training of local ML models coordinated by a central aggregator, all without the need to exchange local data. However, its application on edge devices is hindered by limited computational capabilities and data communication challenges, compounded by the inherent complexity of Deep Learning (DL) models. Model pruning is identified as a key technique for compressing DL models on devices with limited resources. Nonetheless, conventional pruning techniques typically rely on manually crafted heuristics and demand human expertise to achieve a balance between model size, speed, and accuracy, often resulting in sub-optimal solutions. In this study, we introduce an automated federated learning approach utilizing informed pruning, called AutoFLIP, which dynamically prunes and compresses DL models within both the local clients and the global server. It leverages a federated loss exploration phase to investigate model gradient behavior across diverse datasets and losses, providing insights into parameter significance. Our experiments showcase notable enhancements in scenarios with strong non-IID data, underscoring AutoFLIP's capacity to tackle computational constraints and achieve superior global convergence.

5/17/2024

cs.LG cs.AI cs.DC cs.ET

When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices

Pengyu Zhang, Yingjie Liu, Yingbo Zhou, Xiao Du, Xian Wei, Ting Wang, Mingsong Chen

Although Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design, it fails to work on low-memory AIoT devices due to its heavy memory usage. To address this problem, various federated pruning methods are proposed to reduce memory usage during inference. However, few of them can substantially mitigate the memory burdens during pruning and training. As an alternative, zeroth-order or backpropagation-free (BP-Free) methods can partially alleviate the memory consumption, but they suffer from scaling up and large computation overheads, since the gradient estimation error and floating point operations (FLOPs) increase as the dimensionality of the model parameters grows. In this paper, we propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks. We present an approximation to the computation of federated NTK by using the local NTK matrices. Moreover, we demonstrate that the data-free property of our method can substantially reduce the approximation error in extreme data heterogeneity scenarios. Since our approach improves the performance of the vanilla BP-Free method with fewer FLOPs and truly alleviates memory pressure during training and inference, it makes FL more friendly to low-memory devices. Comprehensive experimental results obtained from simulation- and real test-bed-based platforms show that our federated foresight-pruning method not only preserves the ability of the dense model with a memory reduction up to 9x but also boosts the performance of the vanilla BP-Free method with dramatically fewer FLOPs.

5/9/2024

cs.LG cs.AI

Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Kunjal Panchal, Nisarg Parikh, Sunav Choudhary, Lijun Zhang, Yuriy Brun, Hui Guan

Finetuning large language models (LLMs) in federated learning (FL) settings has become important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. This work introduces Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using Forward-mode AD that are closer estimates of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence. We theoretically show that the global gradients in Spry are unbiased estimates of true global gradients for homogeneous data distributions across clients, while heterogeneity increases bias of the estimates. We also derive Spry's convergence rate, showing that the gradients decrease inversely proportional to the number of FL rounds, indicating the convergence up to the limits of heterogeneity. Empirically, Spry reduces the memory footprint during training by 1.4-7.1$times$ in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. Spry reduces the convergence time by 1.2-20.3$times$ and achieves 5.2-13.5% higher accuracy against state-of-the-art zero-order methods. When finetuning Llama2-7B with LoRA, compared to the peak memory usage of 33.9GB of backpropagation, Spry only consumes 6.2GB of peak memory. For OPT13B, the reduction is from 76.5GB to 10.8GB. Spry makes feasible previously impossible FL deployments on commodity mobile and edge devices. Source code is available at https://github.com/Astuary/Spry.

5/27/2024

cs.LG