Elucidating the theoretical underpinnings of surrogate gradient learning in spiking neural networks

2404.14964

Published 6/7/2024 by Julia Gygax, Friedemann Zenke

🧠

Abstract

Training spiking neural networks to approximate complex functions is essential for studying information processing in the brain and neuromorphic computing. Yet, the binary nature of spikes constitutes a challenge for direct gradient-based training. To sidestep this problem, surrogate gradients have proven empirically successful, but their theoretical foundation remains elusive. Here, we investigate the relation of surrogate gradients to two theoretically well-founded approaches. On the one hand, we consider smoothed probabilistic models, which, due to lack of support for automatic differentiation, are impractical for training deep spiking neural networks, yet provide gradients equivalent to surrogate gradients in single neurons. On the other hand, we examine stochastic automatic differentiation, which is compatible with discrete randomness but has never been applied to spiking neural network training. We find that the latter provides the missing theoretical basis for surrogate gradients in stochastic spiking neural networks. We further show that surrogate gradients in deterministic networks correspond to a particular asymptotic case and numerically confirm the effectiveness of surrogate gradients in stochastic multi-layer spiking neural networks. Finally, we illustrate that surrogate gradients are not conservative fields and, thus, not gradients of a surrogate loss. Our work provides the missing theoretical foundation for surrogate gradients and an analytically well-founded solution for end-to-end training of stochastic spiking neural networks.

Create account to get full access

Overview

Training spiking neural networks to approximate complex functions is important for studying the brain and developing neuromorphic computing.
However, the binary nature of spikes poses a challenge for direct gradient-based training.
Surrogate gradients have been empirically successful, but their theoretical foundation has been unclear.
This paper investigates the relation of surrogate gradients to two other theoretical approaches: smoothed probabilistic models and stochastic automatic differentiation.

Plain English Explanation

Spiking neural networks, which mimic the way neurons in the brain communicate using electrical pulses or "spikes," are important for understanding how the brain processes information and for developing neuromorphic computing systems that work more like the brain. However, the binary nature of these spikes makes it challenging to directly train these networks using gradient-based methods, which are commonly used to train other types of neural networks.

To get around this problem, researchers have used a technique called "surrogate gradients," which provides a way to approximate the gradients during training. While surrogate gradients have been shown to work well in practice, their theoretical justification has been unclear. This paper explores the connection between surrogate gradients and two other theoretical approaches: smoothed probabilistic models and stochastic automatic differentiation.

The key insight is that stochastic automatic differentiation, which is designed to handle the randomness inherent in spiking neural networks, can provide the theoretical foundation for surrogate gradients. The paper also shows that surrogate gradients in deterministic networks are a special case of this more general framework.

Overall, this work helps to solidify the theoretical underpinnings of the widely used surrogate gradient approach, providing a more robust theoretical basis for training spiking neural networks.

Technical Explanation

The paper investigates the theoretical foundations of using "surrogate gradients" to train spiking neural networks, which are an important model for understanding information processing in the brain and developing neuromorphic computing systems.

The binary nature of spikes in spiking neural networks makes it challenging to directly apply gradient-based training methods, which are commonly used for other types of neural networks. Surrogate gradients have been shown to work well empirically, but their theoretical justification has been unclear.

The authors explore two theoretically well-founded approaches and their relation to surrogate gradients:

Smoothed probabilistic models: These models provide gradients equivalent to surrogate gradients in single neurons, but are impractical for training deep spiking neural networks due to the lack of support for automatic differentiation.
Stochastic automatic differentiation: This approach is compatible with the discrete randomness inherent in spiking neural networks, but has not previously been applied to spiking neural network training.

The key finding is that stochastic automatic differentiation provides the missing theoretical basis for surrogate gradients in stochastic spiking neural networks. The authors also show that surrogate gradients in deterministic networks correspond to a particular asymptotic case of this more general framework.

Furthermore, the paper demonstrates the effectiveness of surrogate gradients in stochastic multi-layer spiking neural networks through numerical experiments. Finally, the authors show that surrogate gradients are not conservative fields and, therefore, are not gradients of a surrogate loss function.

Critical Analysis

The paper provides a solid theoretical foundation for the widely used surrogate gradient approach to training spiking neural networks, addressing an important gap in the existing literature. By connecting surrogate gradients to well-established techniques like smoothed probabilistic models and stochastic automatic differentiation, the authors have strengthened the theoretical underpinnings of this practical method.

One potential limitation of the work is that it focuses primarily on the theoretical analysis and does not extensively explore the practical implications or real-world applications of the surrogate gradient framework. While the numerical experiments demonstrate its effectiveness, further research may be needed to fully understand the performance and limitations of surrogate gradients in diverse spiking neural network architectures and tasks.

Additionally, the paper does not address potential issues related to the scalability or computational efficiency of the stochastic automatic differentiation approach, which could be important considerations for training large-scale spiking neural networks. Exploring these practical aspects may be a fruitful area for future research.

Overall, this paper makes a valuable contribution to the field by providing a more rigorous theoretical basis for surrogate gradients in spiking neural networks, paving the way for further advancements in this important area of research.

Conclusion

This paper investigates the theoretical foundations of using surrogate gradients to train spiking neural networks, which are crucial for understanding information processing in the brain and developing neuromorphic computing systems. By connecting surrogate gradients to smoothed probabilistic models and stochastic automatic differentiation, the authors have provided a solid theoretical basis for this widely used practical technique.

The key insights are that stochastic automatic differentiation can provide the missing theoretical justification for surrogate gradients in stochastic spiking neural networks, and that surrogate gradients in deterministic networks are a special case of this more general framework. This work helps to strengthen the theoretical underpinnings of spiking neural network training and opens up new avenues for further research in this important field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

A generalized neural tangent kernel for surrogate gradient learning

Luke Eilers, Raoul-Martin Memmesheimer, Sven Goedeke

State-of-the-art neural network training methods depend on the gradient of the network function. Therefore, they cannot be applied to networks whose activation functions do not have useful derivatives, such as binary and discrete-time spiking neural networks. To overcome this problem, the activation function's derivative is commonly substituted with a surrogate derivative, giving rise to surrogate gradient learning (SGL). This method works well in practice but lacks theoretical foundation. The neural tangent kernel (NTK) has proven successful in the analysis of gradient descent. Here, we provide a generalization of the NTK, which we call the surrogate gradient NTK, that enables the analysis of SGL. First, we study a naive extension of the NTK to activation functions with jumps, demonstrating that gradient descent for such activation functions is also ill-posed in the infinite-width limit. To address this problem, we generalize the NTK to gradient descent with surrogate derivatives, i.e., SGL. We carefully define this generalization and expand the existing key theorems on the NTK with mathematical rigor. Further, we illustrate our findings with numerical experiments. Finally, we numerically compare SGL in networks with sign activation function and finite width to kernel regression with the surrogate gradient NTK; the results confirm that the surrogate gradient NTK provides a good characterization of SGL.

5/27/2024

stat.ML cs.LG

🏋️

Direct Training High-Performance Deep Spiking Neural Networks: A Review of Theories and Methods

Chenlin Zhou, Han Zhang, Liutao Yu, Yumin Ye, Zhaokun Zhou, Liwei Huang, Zhengyu Ma, Xiaopeng Fan, Huihui Zhou, Yonghong Tian

Spiking neural networks (SNNs) offer a promising energy-efficient alternative to artificial neural networks (ANNs), in virtue of their high biological plausibility, rich spatial-temporal dynamics, and event-driven computation. The direct training algorithms based on the surrogate gradient method provide sufficient flexibility to design novel SNN architectures and explore the spatial-temporal dynamics of SNNs. According to previous studies, the performance of models is highly dependent on their sizes. Recently, direct training deep SNNs have achieved great progress on both neuromorphic datasets and large-scale static datasets. Notably, transformer-based SNNs show comparable performance with their ANN counterparts. In this paper, we provide a new perspective to summarize the theories and methods for training deep SNNs with high performance in a systematic and comprehensive way, including theory fundamentals, spiking neuron models, advanced SNN models and residual architectures, software frameworks and neuromorphic hardware, applications, and future trends. The reviewed papers are collected at https://github.com/zhouchenlin2096/Awesome-Spiking-Neural-Networks

5/8/2024

cs.NE

🧠

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson

Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions.

5/9/2024

cs.LG stat.ML

🏋️

Approximation and Gradient Descent Training with Neural Networks

G. Welper

It is well understood that neural networks with carefully hand-picked weights provide powerful function approximation and that they can be successfully trained in over-parametrized regimes. Since over-parametrization ensures zero training error, these two theories are not immediately compatible. Recent work uses the smoothness that is required for approximation results to extend a neural tangent kernel (NTK) optimization argument to an under-parametrized regime and show direct approximation bounds for networks trained by gradient flow. Since gradient flow is only an idealization of a practical method, this paper establishes analogous results for networks trained by gradient descent.

5/21/2024

cs.LG