A generalized neural tangent kernel for surrogate gradient learning

2405.15539

Published 5/27/2024 by Luke Eilers, Raoul-Martin Memmesheimer, Sven Goedeke

🧠

Abstract

State-of-the-art neural network training methods depend on the gradient of the network function. Therefore, they cannot be applied to networks whose activation functions do not have useful derivatives, such as binary and discrete-time spiking neural networks. To overcome this problem, the activation function's derivative is commonly substituted with a surrogate derivative, giving rise to surrogate gradient learning (SGL). This method works well in practice but lacks theoretical foundation. The neural tangent kernel (NTK) has proven successful in the analysis of gradient descent. Here, we provide a generalization of the NTK, which we call the surrogate gradient NTK, that enables the analysis of SGL. First, we study a naive extension of the NTK to activation functions with jumps, demonstrating that gradient descent for such activation functions is also ill-posed in the infinite-width limit. To address this problem, we generalize the NTK to gradient descent with surrogate derivatives, i.e., SGL. We carefully define this generalization and expand the existing key theorems on the NTK with mathematical rigor. Further, we illustrate our findings with numerical experiments. Finally, we numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK; the results confirm that the surrogate gradient NTK provides a good characterization of SGL.

Create account to get full access

Overview

Current neural network training methods rely on the gradient of the network function, which cannot be applied to networks with activation functions that lack useful derivatives, such as binary and discrete-time spiking neural networks.
To address this issue, a technique called surrogate gradient learning (SGL) is used, where the activation function's derivative is substituted with a surrogate derivative.
This paper provides a generalization of the neural tangent kernel (NTK), called the surrogate gradient NTK, to enable the analysis of SGL.

Plain English Explanation

Neural networks are a type of machine learning model that are inspired by the structure of the human brain. They are trained by adjusting the strengths of the connections between their artificial neurons, which is done using a process called gradient descent. This process relies on being able to calculate the derivative (or slope) of the network's activation function, which describes how the output of a neuron changes in response to its input.

However, some types of neural networks, such as binary networks and spiking neural networks, have activation functions that don't have useful derivatives. This means that the standard gradient descent training method can't be applied to them. To get around this problem, researchers have developed a technique called surrogate gradient learning (SGL), where they substitute the true derivative of the activation function with an approximate, or "surrogate," derivative.

This paper takes a closer look at SGL and provides a new mathematical framework, called the surrogate gradient NTK, to analyze how it works. The neural tangent kernel (NTK) is a powerful tool for understanding how gradient descent behaves in neural networks, and the authors show how to generalize it to handle the surrogate derivatives used in SGL.

The paper also includes some numerical experiments that demonstrate how the surrogate gradient NTK can be used to characterize the behavior of SGL, especially in networks with binary activation functions. This helps to provide a more solid theoretical foundation for this important technique in machine learning.

Technical Explanation

The authors first study a "naive" extension of the NTK to activation functions with discontinuities, such as the sign function used in binary neural networks. They show that gradient descent for such activation functions is also ill-posed in the infinite-width limit, meaning that the training process may behave unpredictably as the network becomes very large.

To address this issue, the authors then generalize the NTK to handle gradient descent with surrogate derivatives, i.e., the SGL technique. They carefully define this generalization and expand on the existing key theorems about the NTK, providing a more rigorous mathematical treatment.

The authors also present numerical experiments that compare SGL in finite-width networks with sign activation functions to kernel regression using the surrogate gradient NTK. The results confirm that the surrogate gradient NTK provides a good characterization of the behavior of SGL in these types of networks.

Critical Analysis

The paper provides a strong theoretical foundation for understanding the behavior of surrogate gradient learning, which is an important technique for training neural networks with non-differentiable activation functions. The generalization of the NTK to handle surrogate derivatives is a significant contribution, as it enables a more rigorous analysis of SGL.

However, the paper does not address some potential limitations of SGL. For example, the choice of surrogate derivative can have a significant impact on the performance of the training process, and the paper does not explore how to select the best surrogate function. Additionally, the numerical experiments are limited to a specific type of network (binary networks with sign activation functions), and it's unclear how well the surrogate gradient NTK would apply to other types of non-differentiable networks, such as spiking neural networks.

Further research could explore these areas, as well as investigate the practical implications of the surrogate gradient NTK for the design and optimization of neural networks with non-differentiable components.

Conclusion

This paper presents a significant advancement in the theoretical understanding of surrogate gradient learning, a crucial technique for training neural networks with non-differentiable activation functions. By generalizing the neural tangent kernel to handle surrogate derivatives, the authors have provided a more solid mathematical foundation for analyzing the behavior of SGL.

The numerical experiments demonstrate the utility of the surrogate gradient NTK in characterizing the performance of SGL, especially in binary neural networks. This work paves the way for further research into the design and optimization of neural networks with non-differentiable components, which could have important implications for the development of more efficient and robust machine learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

The Positivity of the Neural Tangent Kernel

Lu'is Carvalho, Jo~ao L. Costa, Jos'e Mour~ao, Gonc{c}alo Oliveira

The Neural Tangent Kernel (NTK) has emerged as a fundamental concept in the study of wide Neural Networks. In particular, it is known that the positivity of the NTK is directly related to the memorization capacity of sufficiently wide networks, i.e., to the possibility of reaching zero loss in training, via gradient descent. Here we will improve on previous works and obtain a sharp result concerning the positivity of the NTK of feedforward networks of any depth. More precisely, we will show that, for any non-polynomial activation function, the NTK is strictly positive definite. Our results are based on a novel characterization of polynomial functions which is of independent interest.

4/22/2024

cs.LG cs.AI

🏋️

Approximation and Gradient Descent Training with Neural Networks

G. Welper

It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.

5/21/2024

cs.LG

🧠

Elucidating the theoretical underpinnings of surrogate gradient learning in spiking neural networks

Julia Gygax, Friedemann Zenke

Training spiking neural networks to approximate complex functions is essential for studying information processing in the brain and neuromorphic computing. Yet, the binary nature of spikes constitutes a challenge for direct gradient-based training. To sidestep this problem, surrogate gradients have proven empirically successful, but their theoretical foundation remains elusive. Here, we investigate the relation of surrogate gradients to two theoretically well-founded approaches. On the one hand, we consider smoothed probabilistic models, which, due to lack of support for automatic differentiation, are impractical for training deep spiking neural networks, yet provide gradients equivalent to surrogate gradients in single neurons. On the other hand, we examine stochastic automatic differentiation, which is compatible with discrete randomness but has never been applied to spiking neural network training. We find that the latter provides the missing theoretical basis for surrogate gradients in stochastic spiking neural networks. We further show that surrogate gradients in deterministic networks correspond to a particular asymptotic case and numerically confirm the effectiveness of surrogate gradients in stochastic multi-layer spiking neural networks. Finally, we illustrate that surrogate gradients are not conservative fields and, thus, not gradients of a surrogate loss. Our work provides the missing theoretical foundation for surrogate gradients and an analytically well-founded solution for end-to-end training of stochastic spiking neural networks.

6/7/2024

cs.NE

Equivariant Neural Tangent Kernels

Philipp Misof, Pan Kessel, Jan E. Gerken

Equivariant neural networks have in recent years become an important technique for guiding architecture selection for neural networks with many applications in domains ranging from medical image analysis to quantum chemistry. In particular, as the most general linear equivariant layers with respect to the regular representation, group convolutions have been highly impactful in numerous applications. Although equivariant architectures have been studied extensively, much less is known about the training dynamics of equivariant neural networks. Concurrently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically understand the training dynamics of wide neural networks. In this work, we combine these two fields for the first time by giving explicit expressions for NTKs of group convolutional neural networks. In numerical experiments, we demonstrate superior performance for equivariant NTKs over non-equivariant NTKs on a classification task for medical images.

6/11/2024

cs.LG