Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Read original: arXiv:2010.12998 - Published 4/12/2024 by Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

🌀

Overview

The paper introduces a new distributed Stochastic Gradient Descent (SGD) algorithm called Hierarchical SGD (H-SGD) for multi-level communication networks.
In H-SGD, workers send their updated local models to local servers for aggregation before each global aggregation.
The paper analyzes the effect of local aggregation on the global convergence of H-SGD.

Plain English Explanation

Distributed machine learning is a way to train models using multiple computers working together. In this approach, data is split across different workers, and they each update a local model. These local models are then combined to create a single global model.

The paper introduces a new distributed SGD algorithm called H-SGD that adds an extra step to this process. Before the global aggregation, the local models are first aggregated at the local server level. The researchers then analyze how this additional local aggregation step affects the overall convergence of the global model.

To do this, the paper introduces a new way of looking at the "divergence" - or difference - between the local and global models. It uses this concept of upward and downward divergence to derive a theoretical bound on the convergence of H-SGD.

The key insight is that the convergence of H-SGD is "sandwiched" between the convergence of two simpler settings: one where there is only local aggregation, and one where there is only global aggregation. This suggests that the local aggregation step in H-SGD can actually help improve the overall convergence of the global model.

Technical Explanation

The paper first introduces the H-SGD algorithm, which performs local aggregation at intermediate servers before the final global aggregation. To analyze its convergence, the authors define new notions of "upward" and "downward" divergences to capture the difference between local and global models.

Using these divergence measures, the paper provides a worst-case convergence upper bound for two-level H-SGD, considering non-IID data, non-convex objectives, and stochastic gradients. This bound is shown to be "sandwiched" between the upper bounds of two single-level SGD settings - one with only local updates, and one with only global updates.

The authors then extend this analysis to the general case of H-SGD with more than two levels, and again observe the same sandwich behavior in the convergence upper bounds. This suggests that the local aggregation step in H-SGD can actually help improve the overall convergence compared to a purely global aggregation approach.

The theoretical analysis provides insights into why local aggregation can be beneficial for the convergence of H-SGD, which the authors argue is an important step towards a better understanding of this new distributed optimization algorithm.

Critical Analysis

The paper provides a rigorous theoretical analysis of the convergence properties of H-SGD, which is a novel contribution to the field of distributed machine learning. The authors' introduction of upward and downward divergence measures is a clever way to capture the interplay between local and global model updates.

However, the analysis is limited to a worst-case convergence upper bound, and does not provide guarantees on the actual convergence rate or the tightness of the bound. Additionally, the theoretical results assume certain conditions, such as bounded gradients and Lipschitz-continuous objective functions, which may not always hold in practice.

Furthermore, the paper does not explore the practical performance of H-SGD compared to other distributed SGD algorithms, such as Decentralized SGD or AdaGossip. Empirical evaluations on real-world datasets and applications would help validate the theoretical insights and demonstrate the potential benefits of H-SGD in realistic settings.

Future research could also investigate the impact of different local aggregation strategies, the effect of network topology, and the potential connections to other hierarchical optimization techniques, such as multi-level graph subspace contrastive learning or iterative local graph generation. Such extensions could further strengthen the understanding and applicability of H-SGD.

Conclusion

The paper introduces a new distributed SGD algorithm called Hierarchical SGD (H-SGD) and provides a novel theoretical analysis of its convergence properties. The key insight is that the local aggregation step in H-SGD can actually help improve the overall convergence of the global model, as evidenced by the "sandwich" behavior observed in the convergence upper bounds.

These theoretical results offer important insights into the role of local aggregation in distributed optimization and could inform the design of more effective distributed learning algorithms. While the analysis has certain limitations, the paper represents a valuable contribution to the understanding of hierarchical communication structures in the context of large-scale machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of upward and downward divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the sandwich behavior. Furthermore, we extend our analytical approach based on upward and downward divergences to study the convergence for the general case of H-SGD with more than two levels, where the sandwich behavior still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.

4/12/2024

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.

5/21/2024

🛠️

Convergence of Distributed Adaptive Optimization with Local Updates

Ziheng Cheng, Margalit Glasgow

We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that em Local SGD em with momentum (em Local em SGDM) and em Local em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial but challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping.

9/23/2024

Asynchronous Local-SGD Training for Language Modeling

Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

9/24/2024