Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training

2007.13985

YC

0

Reddit

0

Published 4/16/2024 by Shen-Yi Zhao, Chang-Wei Shi, Yin-Peng Xie, Wu-Jun Li

🏋️

Abstract

Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an $epsilon$-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Stochastic Gradient Descent (SGD) and its variants are widely used optimization methods in machine learning
  • Large-batch SGD can better utilize computational power of modern hardware like GPUs and reduce communication rounds in distributed training
  • However, large-batch training often leads to a drop in generalization accuracy, which is a challenging problem

Plain English Explanation

Machine learning models are often trained using an optimization technique called Stochastic Gradient Descent (SGD). SGD works by gradually adjusting the model's parameters to minimize the training error. One variant of SGD, called large-batch SGD, can take advantage of powerful multi-core hardware like graphics processing units (GPUs) to speed up the training process. Large-batch SGD can also reduce the amount of communication required when training models across multiple machines in a distributed system.

Despite these benefits, researchers have found that large-batch SGD often leads to a decrease in the model's ability to generalize, meaning it may perform worse on new, unseen data compared to models trained with smaller batches. This is a significant challenge, as the ultimate goal of machine learning is to create models that can perform well on real-world data, not just the data used during training.

Technical Explanation

In this paper, the authors propose a new method called Stochastic Normalized Gradient Descent with Momentum (SNGM) to address the generalization issue with large-batch SGD. They prove that SNGM can use a larger batch size than another popular variant called Momentum SGD (MSGD) while still converging to a high-quality solution.

The key idea behind SNGM is to normalize the gradients during the optimization process, which helps the model escape from poor local minima and improve its generalization performance. The authors also incorporate a momentum term, which helps the optimization process converge more quickly.

Through experiments on deep learning tasks, the authors show that when using the same large batch size, SNGM can achieve better test accuracy compared to MSGD and other state-of-the-art large-batch training methods.

Critical Analysis

The authors provide a solid theoretical analysis of SNGM and demonstrate its empirical effectiveness. However, the paper does not explore the limits of SNGM's performance or discuss potential drawbacks or caveats.

For example, the authors do not investigate how SNGM's performance scales with the batch size or the depth of the neural network being trained. It's possible that SNGM may not be as effective for extremely large batch sizes or very deep models.

Additionally, the paper only evaluates SNGM on a limited set of tasks and datasets. More extensive testing on a wider range of problems would help validate the generalizability of the method.

Conclusion

This paper presents a promising new optimization technique called Stochastic Normalized Gradient Descent with Momentum (SNGM) that can enable the use of large batch sizes in training machine learning models without sacrificing generalization performance. By normalizing the gradients and incorporating momentum, SNGM can outperform other large-batch training methods.

While the theoretical and empirical results are compelling, further research is needed to fully understand the limitations and potential of SNGM. Nonetheless, this work contributes an important step towards addressing a significant challenge in the field of machine learning.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

YC

0

Reddit

0

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

Read more

4/17/2024

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Qinzi Zhang, Ashok Cutkosky

YC

0

Reddit

0

Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.

Read more

5/17/2024

Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Naoki Sato, Hideaki Iiduka

YC

0

Reddit

0

For nonconvex objective functions, including deep neural networks, stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, but a theoretical explanation for this is lacking. In contrast to previous studies that defined the stochastic noise that occurs during optimization as the variance of the stochastic gradient, we define it as the gap between the search direction of the optimizer and the steepest descent direction and show that its level dominates generalizability of the model. We also show that the stochastic noise in SGD with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. By numerically deriving the stochastic noise level in SGD and SGD with momentum, we provide theoretical findings that help explain the training dynamics of SGD with momentum, which were not explained by previous studies on convergence and stability. We also provide experimental results supporting our assertion that model generalizability depends on the stochastic noise level.

Read more

5/29/2024

📶

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Anh Dang, Reza Babanezhad, Sharan Vaswani

YC

0

Reddit

0

Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size $1$) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold $b^*$ that depends on the condition number $kappa$. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an $Oleft(exp(-frac{T}{sqrt{kappa}}) + sigma right)$ convergence, where $T$ is the number of iterations and $sigma^2$ is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a $kappa$ dependence in $b^*$ is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an $Oleft(expleft(-frac{T}{sqrt{kappa}}right) + frac{sigma}{T}right)$ rate. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an $O(exp(-frac{T}{kappa}) + frac{sigma^2}{T})$ rate. We empirically demonstrate the effectiveness of the proposed algorithms.

Read more

6/12/2024