An Improved Empirical Fisher Approximation for Natural Gradient Descent

2406.06420

Published 6/11/2024 by Xiaodong Wu, Wenyi Yu, Chao Zhang, Philip Woodland

An Improved Empirical Fisher Approximation for Natural Gradient Descent

Abstract

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper first investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of the poor empirical approximation quality. An improved empirical Fisher (iEF) method, motivated as a generalised NGD method from a loss reduction perspective, is proposed to address this issue, meanwhile retaining the practical convenience of EF. The exact iEF and EF methods are experimentally evaluated using practical deep learning setups, including widely-used setups for parameter-efficient fine-tuning of pre-trained models (T5-base with LoRA and Prompt-Tuning on GLUE tasks, and ViT with LoRA for CIFAR100). Optimisation experiments show that applying exact iEF as an optimiser provides strong convergence and generalisation. It achieves the best test performance and the lowest training loss for majority of the tasks, even when compared with well-tuned AdamW/Adafactor baselines. Additionally, under a novel empirical evaluation framework, the proposed iEF method shows consistently better approximation quality to the exact Natural Gradient updates than both EF and the more expensive sampled Fisher (SF). Further investigation also shows that the superior approximation quality of iEF is robust to damping across tasks and training stages. Improving existing approximate NGD optimisers with iEF is expected to lead to better convergence ability and stronger robustness to choice of damping.

Create account to get full access

Overview

This paper proposes an improved empirical Fisher approximation for natural gradient descent, a powerful optimization method used in deep learning.
The authors aim to address the high computational cost and memory requirements of exact natural gradient descent by developing a more efficient approximation.
The proposed method, called Improved Empirical Fisher Approximation (IEFA), outperforms existing approximations in terms of performance and scalability.

Plain English Explanation

Natural gradient descent is a powerful optimization technique used in deep learning, but it can be computationally expensive and memory-intensive. The Improved Empirical Fisher Approximation (IEFA) proposed in this paper aims to address these issues by providing a more efficient approximation of the natural gradient.

The natural gradient is a way of adjusting the gradient of a model's parameters to take into account the geometry of the parameter space. This can lead to faster convergence and better performance, but calculating the exact natural gradient can be challenging, especially for large-scale deep learning models.

The IEFA method uses a data-driven approach to estimate the natural gradient, without needing to compute expensive matrix inversions or store large covariance matrices. This makes it more scalable and efficient than previous approximations, such as Exact Gauss-Newton Optimization and FAdaM.

The authors demonstrate that the IEFA method achieves comparable or better performance than these existing approaches, while being more computationally efficient and requiring less memory. This could make natural gradient descent a more practical and accessible optimization technique for a wider range of deep learning applications.

Technical Explanation

The key idea behind the Improved Empirical Fisher Approximation (IEFA) is to estimate the natural gradient using a data-driven approach, without the need for expensive matrix inversions or large covariance matrix calculations.

The authors build on the concept of Empirical Fisher Approximation (EFA), which uses the outer product of gradients to approximate the Fisher information matrix. However, the original EFA method has some limitations, such as poor scaling with large models and sensitivity to the choice of hyperparameters.

To address these issues, the IEFA method introduces several key improvements:

Adaptive Scaling: The authors propose an adaptive scaling mechanism that adjusts the EFA approximation based on the curvature of the loss function, as captured by the Approximate Fisher Information Matrix (AFIM). This helps to better balance the contributions of different parameter groups during optimization.
Unbiased Estimation: The IEFA method uses an unbiased estimator of the natural gradient, which eliminates the bias introduced by the original EFA approach.
Improved Stability: The authors introduce additional techniques, such as damping and regularization, to improve the numerical stability and robustness of the IEFA method.

Through extensive experiments on various deep learning tasks, the authors demonstrate that the IEFA method outperforms existing natural gradient approximations in terms of both performance and scalability. It achieves comparable or better results than exact natural gradient descent, while being more computationally efficient and memory-friendly.

Critical Analysis

The authors have made a significant contribution to improving the efficiency and practicality of natural gradient descent, a powerful optimization technique in deep learning. The Improved Empirical Fisher Approximation (IEFA) method addresses several key limitations of previous approximations, such as high computational cost and sensitivity to hyperparameters.

However, the paper does not address some potential limitations and areas for further research:

Sensitivity to Initialization: The performance of natural gradient descent methods, including IEFA, can be sensitive to the initialization of model parameters. The authors could explore techniques to improve the robustness of IEFA to different initializations.
Convergence Guarantees: The paper does not provide theoretical guarantees on the convergence properties of the IEFA method. Analyzing the convergence behavior, especially in comparison to exact natural gradient descent, could be a valuable direction for future research.
Generalization to Diverse Architectures: The experiments in the paper focus on fully-connected and convolutional neural networks. It would be interesting to see how the IEFA method performs on a wider range of deep learning architectures, such as recurrent neural networks or transformers.
Practical Considerations: The authors could provide more guidance on the selection of hyperparameters for the IEFA method, as well as insights into the computational and memory requirements in practical deployments.

Overall, the Improved Empirical Fisher Approximation (IEFA) is a promising advancement in the field of natural gradient optimization, and the authors have made a valuable contribution. Further research to address the limitations mentioned above could help to solidify the practicality and robustness of this approach.

Conclusion

The paper presents the Improved Empirical Fisher Approximation (IEFA), a novel method for approximating the natural gradient in deep learning optimization. By addressing the limitations of previous approaches, the IEFA method achieves better performance and scalability, making natural gradient descent a more accessible and practical optimization technique for a wider range of deep learning applications.

The key insights and contributions of this paper include:

A data-driven approach to estimating the natural gradient, without the need for expensive matrix inversions or large covariance matrix calculations.
An adaptive scaling mechanism that adjusts the approximation based on the curvature of the loss function.
An unbiased estimator of the natural gradient, which eliminates the bias introduced by the original Empirical Fisher Approximation (EFA) method.
Techniques to improve the numerical stability and robustness of the IEFA method.

The IEFA method outperforms existing natural gradient approximations in terms of both performance and computational efficiency, demonstrating its potential to make natural gradient descent a more practical and widely-adopted optimization technique in deep learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

Xinwei Ou, Ce Zhu, Xiaolin Huang, Yipeng Liu

Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.

4/30/2024

cs.LG cs.CV

🛠️

Exact Gauss-Newton Optimization for Training Deep Neural Networks

Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon

We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.

5/24/2024

cs.LG stat.ML

🌿

FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

Dongseong Hwang

This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.

6/4/2024

cs.LG cs.AI cs.IT

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}

5/28/2024

cs.LG