Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems

Read original: arXiv:2403.08585 - Published 5/28/2024 by Junwei Su, Difan Zou, Chuan Wu

🚀

Introduction

The provided text discusses the role of Stochastic Gradient Descent (SGD) in deep learning and its ability to provide implicit regularization, even in the absence of explicit regularizers. The text highlights a close relationship between the implicit regularization of SGD and the explicit L2-type regularization (ridge regression). It notes that while SGD can outperform ridge regression in certain problem instances, it can also underperform in others, which is attributed to an inherent imbalance in the optimization within the data covariance matrix.

The central question addressed in the study is whether preconditioning techniques can enhance the generalization performance of SGD and close the gap relative to ridge regression. The paper makes the following key contributions:

It extends the existing results characterizing the excessive risk (generalization performance) of SGD and ridge regression to include scenarios with preconditioning, providing insights into how the preconditioning matrix influences the excessive risk.
It proposes a simple preconditioning matrix tailored to SGD, which leverages information from the data covariance matrix. It is shown that under the theoretical setting, SGD with the proposed preconditioning can consistently outperform standard ridge regression and a representative family of preconditioned ridge regression.
Even in practical scenarios where precise information about the data covariance matrix is unavailable, the study illustrates that the proposed preconditioning design allows for robust estimation using a finite set of inexpensive unlabeled data. In this case, SGD can still consistently outperform the excessive risk of standard and preconditioned ridge regression.

The paper aims to enhance the understanding of the generalization performance of SGD and the potential benefits of preconditioning in improving its performance relative to ridge regression.

Related Work

This section provides an overview of the technical advancements in the analysis of ridge regression and stochastic gradient descent (SGD), as well as related research comparing SGD to explicit norm-based regularization and ridge regression. The section also discusses the current usage of preconditioning in both SGD and ridge regression.

Ridge regression holds a central position in machine learning. There is a comprehensive understanding of the excess risk bounds of ridge regression in the underparameterized scenario, but more work is needed to characterize the excess risk in the overparameterized regime. Recent advancements have provided detailed non-asymptotic risk bounds for ordinary least squares and ridge regression in overparameterized contexts.

In the finite-dimensional setting, risk bounds for one-pass, constant-stepsize SGD have been well-established. In the underparameterized setting, constant-stepsize SGD with tail-averaging has been recognized for achieving the minimax optimal rate for least squares. A recent study has expanded on the previous analysis, offering a detailed risk bound for SGD in the overparameterized regime.

It is well known that in least squares problems, multi-pass SGD approaches the solution with the minimum norm, representing a recognized implicit bias of SGD. However, this norm-based regularization alone does not fully explain the optimization behavior of SGD in more extensive scenarios. Recent work has presented a comparative analysis between SGD and ridge regression from the perspective of excess risk, demonstrating that SGD can outperform ridge regression in some scenarios but fall short in others.

Preconditioning is a widely employed optimization technique aimed at enhancing the efficiency and effectiveness of optimization algorithms, including SGD and ridge regression. However, the impact of preconditioning on the excess risk of SGD and ridge regression, and whether a preconditioning matrix can enable SGD to consistently outperform ridge regression, remain open questions that motivate the research.

Problem Setup and Preliminaries

The paper explores the usage of precondition in stochastic gradient descent (SGD) and compares the generalization performance of preconditioned SGD and ridge regression when solving least squares problems in the overparameterized regime. The authors make several assumptions about the data distribution, including that the covariance matrix is strictly positive definite and has a finite trace, the fourth-order moment is bounded, and the noise-data covariance is bounded.

The paper introduces preconditioned SGD, where the update rule includes a preconditioned matrix G that differentiates it from standard SGD. The final estimator for preconditioned SGD is computed as a tail-averaged of the iterates. The paper also introduces preconditioned ridge regression, where the preconditioned matrix M distinguishes it from standard ridge regression.

The goal of the paper is to demonstrate the existence of a preconditioning matrix that enables SGD to achieve comparable or superior excessive risk to ridge regression when provided with an equal amount of training data. The authors focus on a setting where both SGD and ridge regression are considered "generalizable", meaning they achieve a vanishing excess risk as the number of training samples goes to infinity.

Main Results

The main results of this paper are summarized as follows:

The authors provide an excessive risk characterization of SGD and ridge regression with preconditioning. They establish an excessive risk upper bound for preconditioned SGD and an excessive risk lower bound for preconditioned ridge regression.
The authors propose a design for the preconditioning matrix that aims to amplify and flatten the relative signal strength of the leading eigenspace, in order to improve the performance of SGD compared to ridge regression.
Under a theoretical setting with access to the true covariance matrix, the authors show that preconditioned SGD can consistently outperform standard ridge regression, as well as ridge regression with a representative family of preconditioning matrices.
The authors also address practical scenarios where the exact covariance matrix is unavailable. They demonstrate that their proposed preconditioning matrix design allows for robust estimation from finite unlabeled data, and preconditioned SGD still maintains a theoretical advantage over ridge regression in this setting.
The empirical study supports the proposed theoretical findings, showing that preconditioning can indeed improve the generalization performance of SGD, and preconditioned SGD can outperform ridge regression, even when the preconditioning matrix is estimated from unlabeled data.

Conclusion

This paper explores implicit regularization in stochastic gradient descent (SGD) with preconditioning. The analysis extends the understanding of excessive risk associated with both SGD and ridge regression under preconditioning. The results show that preconditioned SGD consistently outperforms ridge regression with and without preconditioning, bridging the performance gap between the two methods. The proposed preconditioned matrix can be robustly estimated using readily available, inexpensive, and unlabeled data, making it practical and applicable in real-world settings. The findings establish that SGD with preconditioning achieves better generalization performance compared to ridge regression, highlighting the potential for further improvement of SGD through preconditioning techniques. Future research will explore whether preconditioning can enhance the implicit regularization of SGD in other linear and potentially nonlinear models.

Appendix A Excessive Risk Analysis of Preconditioned SGD and Ridge

This paper extends the proof techniques and results from previous work to sharply characterize the excess risk bound for stochastic gradient descent (SGD) with preconditioned data and ridge regression with preconditioned data.

For preconditioned SGD, the authors derive the update rules for the bias and variance errors of the preconditioned iterates. They show that the excess risk of preconditioned SGD can be analyzed by studying the excess risk of standard SGD on a transformed data covariance matrix, ground truth vector, and iterates.

For preconditioned ridge regression, the authors extend previous results on the bias and variance of standard ridge regression to the preconditioned case. They show that the preconditioned ridge regression problem is equivalent to a standard ridge regression problem on a transformed data and ground truth. This allows them to directly apply the standard ridge regression results to the preconditioned case.

The main technical contributions are the derivation of the update rules for preconditioned SGD and the reduction of preconditioned ridge regression to standard ridge regression. The results provide sharp excess risk bounds for these preconditioned optimization algorithms.

Appendix B Analysis of precondition SGD with standard ridge regression

The key points from the provided text are:

The authors present a proof for Theorem 4.3, which shows that the excessive risk of preconditioned SGD is bounded by the excessive risk of ridge regression.
The proof involves several steps:

Lemma B.1 shows that preconditioning does not change the total signal strength.
Lemma B.2 characterizes the updated eigenvalues of the transformed data covariance matrix. It shows the eigenvectors are preserved, and provides a mapping between the eigenvalues.
Lemma B.3 demonstrates the mapping between the eigenvalues is monotonic, preserving the order of eigenvectors and eigenvalues.
The proof then decomposes the SGD bias and variance into two cases based on the relationship between the ridge regularization parameter λ̂ and the trace of the transformed data covariance matrix.
For each case, the authors show the SGD bias and variance can be bounded by the corresponding quantities for ridge regression.
Combining the bias and variance bounds, the authors conclude there exist choices of the preconditioning parameters η and β such that the excessive risk of preconditioned SGD is bounded by the excessive risk of ridge regression.

Appendix C Analysis of preconditioned SGD with preconditioned ridge regression

This section compares the excessive risk of preconditioned SGD and preconditioned ridge regression. The key findings are:

The preconditioning matrix does not affect the overall signal strength in the excessive risk of preconditioned ridge regression.
The transformed data covariance matrix 𝐇~ and 𝐇^ share the same eigenvectors as the original data covariance matrix 𝐇, with tractable eigenvalues.
For a given preconditioning matrix 𝐌, the proposed preconditioned matrix 𝐆 for SGD can outperform the excessive risk of preconditioned ridge regression. Specifically, the bias and variance of preconditioned SGD are shown to be bounded by those of preconditioned ridge regression, under appropriate settings of the learning rate 𝜂.

In summary, the analysis demonstrates that the proposed preconditioning method for SGD can achieve a lower excessive risk compared to preconditioned ridge regression.

Appendix D Analysis of precondition SGD with empirical estimation

In this section, the authors consider a practical scenario where the precise covariance matrix H is not available, but instead a set of unlabeled data points are accessible that are sampled from the same distribution as the training data. The authors propose estimating a preconditioning matrix G using this unlabeled data, and then analyze the transformed covariance matrix H̃ = G1/2HG1/2.

The key results are:

The authors provide a lower bound on the k-th largest eigenvalue of H̃ that holds with high probability.
The authors show that the trace of G̃H is upper bounded by (1 + C) times the trace of H, where C is some constant.
Building on these results, the authors are able to show that preconditioned SGD can consistently outperform preconditioned ridge regression in terms of both bias and variance. Specifically, the excessive risk of preconditioned SGD is shown to be bounded by that of preconditioned ridge regression.

The analysis relies on several technical lemmas, including results on the relationship between the eigenvalues of the transformed covariance matrix H̃ and the original covariance matrix H.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems

Junwei Su, Difan Zou, Chuan Wu

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard & preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can make SGD comparable to (standard & preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical improvement. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.

5/28/2024

↗️

Robust, randomized preconditioning for kernel ridge regression

Mateo D'iaz, Ethan N. Epperly, Zachary Frangella, Joel A. Tropp, Robert J. Webber

This paper investigates two randomized preconditioning techniques for solving kernel ridge regression (KRR) problems with a medium to large number of data points ($10^4 leq N leq 10^7$), and it introduces two new methods with state-of-the-art performance. The first method, RPCholesky preconditioning, accurately solves the full-data KRR problem in $O(N^2)$ arithmetic operations, assuming sufficiently rapid polynomial decay of the kernel matrix eigenvalues. The second method, KRILL preconditioning, offers an accurate solution to a restricted version of the KRR problem involving $k ll N$ selected data centers at a cost of $O((N + k^2) k log k)$ operations. The proposed methods solve a broad range of KRR problems, making them ideal for practical applications.

7/12/2024

Graph Neural Preconditioners for Iterative Solutions of Sparse Linear Systems

Jie Chen

Preconditioning is at the heart of iterative solutions of large, sparse linear systems of equations in scientific disciplines. Several algebraic approaches, which access no information beyond the matrix itself, are widely studied and used, but ill-conditioned matrices remain very challenging. We take a machine learning approach and propose using graph neural networks as a general-purpose preconditioner. They show attractive performance for ill-conditioned problems, in part because they better approximate the matrix inverse from appropriately generated training data. Empirical evaluation on over 800 matrices suggests that the construction time of these graph neural preconditioners (GNPs) is more predictable than other widely used ones, such as ILU and AMG, while the execution time is faster than using a Krylov method as the preconditioner, such as in inner-outer GMRES. GNPs have a strong potential for solving large-scale, challenging algebraic problems arising from not only partial differential equations, but also economics, statistics, graph, and optimization, to name a few.

6/4/2024

An operator preconditioning perspective on training in physics-informed machine learning

Tim De Ryck, Florent Bonnet, Siddhartha Mishra, Emmanuel de B'ezenac

In this paper, we investigate the behavior of gradient descent algorithms in physics-informed machine learning methods like PINNs, which minimize residuals connected to partial differential equations (PDEs). Our key result is that the difficulty in training these models is closely related to the conditioning of a specific differential operator. This operator, in turn, is associated to the Hermitian square of the differential operator of the underlying PDE. If this operator is ill-conditioned, it results in slow or infeasible training. Therefore, preconditioning this operator is crucial. We employ both rigorous mathematical analysis and empirical evaluations to investigate various strategies, explaining how they better condition this critical operator, and consequently improve training.

5/6/2024