Global Momentum Compression for Sparse Communication in Distributed Learning

1905.12948

Published 4/4/2024 by Chang-Wei Shi, Shen-Yi Zhao, Yin-Peng Xie, Hao Gao, Wu-Jun Li

👨‍🏫

Abstract

With the rapid growth of data, distributed momentum stochastic gradient descent~(DMSGD) has been widely used in distributed learning, especially for training large-scale deep models. Due to the latency and limited bandwidth of the network, communication has become the bottleneck of distributed learning. Communication compression with sparsified gradient, abbreviated as emph{sparse communication}, has been widely employed to reduce communication cost. All existing works about sparse communication in DMSGD employ local momentum, in which the momentum only accumulates stochastic gradients computed by each worker locally. In this paper, we propose a novel method, called emph{underline{g}}lobal emph{underline{m}}omentum emph{underline{c}}ompression~(GMC), for sparse communication. Different from existing works that utilize local momentum, GMC utilizes global momentum. Furthermore, to enhance the convergence performance when using more aggressive sparsification compressors (e.g., RBGS), we extend GMC to GMC+. We theoretically prove the convergence of GMC and GMC+. To the best of our knowledge, this is the first work that introduces global momentum for sparse communication in distributed learning. Empirical results demonstrate that, compared with the local momentum counterparts, our GMC and GMC+ can achieve higher test accuracy and exhibit faster convergence, especially under non-IID data distribution.

Create account to get full access

Overview

As datasets grow larger, distributed learning has become increasingly important, especially for training large-scale deep learning models.
Communication between distributed workers has become a bottleneck due to network latency and limited bandwidth.
Compressing the gradients sent between workers, known as "sparse communication," helps reduce the communication cost.
Existing sparse communication methods in distributed learning use local momentum, where the momentum only accumulates gradients computed by each worker locally.

Plain English Explanation

Training large machine learning models often requires splitting the work across many computers, or "workers," working together in a distributed system. This distributed approach helps speed up the training process. However, the need for these workers to constantly communicate with each other can slow things down, as network delays and bandwidth limitations create a communication bottleneck.

To address this, the researchers propose a new method called "Global Momentum Compression" (GMC) that compresses the information shared between workers. Rather than each worker only remembering its own past gradients (local momentum), GMC has the workers collectively remember a global set of past gradients. This global momentum helps the model converge faster and achieve higher accuracy, especially when the data is unevenly distributed across the workers.

The researchers also introduce an enhanced version called GMC+ that further improves performance when using more aggressive compression techniques. Both GMC and GMC+ come with mathematical proofs showing they will converge properly.

Technical Explanation

The paper proposes two new methods for sparse communication in distributed momentum stochastic gradient descent (DMSGD):

Global Momentum Compression (GMC): This approach utilizes a shared "global momentum" across all worker nodes, rather than the typical "local momentum" used in prior work. The global momentum accumulates gradients computed by all workers, rather than just those computed locally.
Global Momentum Compression+ (GMC+): This extends GMC by further enhancing the convergence when using more aggressive gradient sparsification (compression) techniques, such as RBGS.

The key innovation is the use of global momentum, rather than local momentum as in previous sparse communication methods. The authors provide theoretical convergence guarantees for both GMC and GMC+.

The empirical results show that GMC and GMC+ can achieve higher test accuracy and faster convergence compared to local momentum approaches, especially under non-i.i.d. (unevenly distributed) data conditions.

Critical Analysis

The paper provides a novel and promising approach to addressing the communication bottleneck in distributed learning. The use of global momentum is an interesting idea that seems to offer performance benefits.

However, the analysis is limited to convex optimization problems and fully-connected neural networks. It would be valuable to see how GMC and GMC+ perform on more complex, state-of-the-art deep learning models and datasets.

Additionally, the paper does not discuss the overhead or computational complexity introduced by the global momentum mechanism. This would be an important consideration, as it could potentially offset the benefits of reduced communication.

Further research could also explore the sensitivity of GMC and GMC+ to hyperparameter settings, as well as how they scale to larger numbers of worker nodes.

Conclusion

This paper presents a novel approach called Global Momentum Compression (GMC) that leverages a shared global momentum across distributed workers to improve the efficiency of sparse communication in distributed learning. The authors show that GMC and its enhanced version GMC+ can achieve higher accuracy and faster convergence compared to existing local momentum methods, particularly when the data is unevenly distributed.

The use of global momentum is a promising direction for addressing the communication bottleneck in large-scale distributed learning. While the current analysis is limited, the theoretical guarantees and empirical results suggest that GMC and GMC+ warrant further investigation and development. Continued research in this area could lead to significant advancements in the scalability and efficiency of distributed deep learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Compressed and Sparse Models for Non-Convex Decentralized Learning

Andrew Campbell, Hang Liu, Leah Woldemariam, Anna Scaglione

Recent research highlights frequent model communication as a significant bottleneck to the efficiency of decentralized machine learning (ML), especially for large-scale and over-parameterized neural networks (NNs). To address this, we present Malcom-PSGD, a novel decentralized ML algorithm that combines gradient compression techniques with model sparsification. We promote model sparsity by adding $ell_1$ regularization to the objective and present a decentralized proximal SGD method for training. Our approach employs vector source coding and dithering-based quantization for the compressed gradient communication of sparsified models. Our analysis demonstrates that Malcom-PSGD achieves a convergence rate of $mathcal{O}(1/sqrt{t})$ with respect to the iterations $t$, assuming a constant consensus and learning rate. This result is supported by our proof for the convergence of non-convex compressed Proximal SGD methods. Additionally, we conduct a bit analysis, providing a closed-form expression for the communication costs associated with Malcom-PSGD. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately $75%$ when compared to the state-of-the-art.

6/7/2024

cs.LG cs.DC cs.DS cs.MA

🧠

SASG: Sparse Communication with Adaptive Aggregated Stochastic Gradients for Distributed Learning

Xiaoge Deng, Dongsheng Li, Tao Sun, Xicheng Lu

Gradient-based optimization methods implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the high communication overhead for exchanging information, such as stochastic gradients, between workers. The inherent causes of this bottleneck are the frequent communication rounds and the full model gradient transmission in every round. In this study, we present SASG, a communication-efficient distributed algorithm that enjoys the advantages of sparse communication and adaptive aggregated stochastic gradients. By dynamically determining the workers who need to communicate through an adaptive aggregation rule and sparsifying the transmitted information, the SASG algorithm reduces both the overhead of communication rounds and the number of communication bits in the distributed system. For the theoretical analysis, we introduce an important auxiliary variable and define a new Lyapunov function to prove that the communication-efficient algorithm is convergent. The convergence result is identical to the sublinear rate of stochastic gradient descent, and our result also reveals that SASG scales well with the number of distributed workers. Finally, experiments on training deep neural networks demonstrate that the proposed algorithm can significantly reduce communication overhead compared to previous methods.

6/11/2024

cs.DC

🛠️

Near Optimal Decentralized Optimization with Compression and Momentum Tracking

Rustem Islamov, Yuan Gao, Sebastian U. Stich

Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.

5/31/2024

cs.LG cs.AI stat.ML

Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance

Alexander Stollenwerk, Laurent Jacques

We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-Hadamard transform to flatten the stochastic gradient before quantization. As a result, the approximation of the true gradient in this scheme is biased, but it prevents commonly encountered algorithmic problems, such as exploding variance in the one-bit compression regime, deterioration of performance in the case of sparse gradients, and restrictive assumptions on the distribution of the stochastic gradients. In fact, we show SGD-like convergence guarantees under mild conditions. The compression technique can be used in both directions of worker-server communication, therefore admitting distributed optimization with full communication compression.

5/21/2024

cs.LG cs.NA