Memory capacity of three-layer neural networks with non-polynomial activations

2405.13738

Published 5/24/2024 by Liam Madden

🧠

Abstract

The minimal number of neurons required for a feedforward neural network to interpolate $n$ generic input-output pairs from $mathbb{R}^dtimes mathbb{R}$ is $Theta(sqrt{n})$. While previous results have shown that $Theta(sqrt{n})$ neurons are sufficient, they have been limited to logistic, Heaviside, and rectified linear unit (ReLU) as the activation function. Using a different approach, we prove that $Theta(sqrt{n})$ neurons are sufficient as long as the activation function is real analytic at a point and not a polynomial there. Thus, the only practical activation functions that our result does not apply to are piecewise polynomials. Importantly, this means that activation functions can be freely chosen in a problem-dependent manner without loss of interpolation power.

Create account to get full access

Overview

This paper investigates the minimum number of neurons required for a feedforward neural network to accurately interpolate a set of generic input-output pairs.
Previous results have shown that Θ(√n) neurons are sufficient for this task, but were limited to certain activation functions like logistic, Heaviside, and ReLU.
The authors prove that Θ(√n) neurons are sufficient for any real-analytic activation function that is not a polynomial, greatly expanding the range of applicable activation functions.

Plain English Explanation

Neural networks are machine learning models that can learn to map inputs to outputs by adjusting the strengths of connections between artificial neurons. When training a neural network to interpolate a set of input-output pairs, the number of neurons needed is an important consideration.

Previous research has shown that square root of the number of input-output pairs is the minimum number of neurons required, but this was limited to certain types of activation functions - the mathematical equations that determine how each neuron processes its inputs.

This new paper demonstrates that the square root of the number of input-output pairs is sufficient for any activation function that is "real-analytic" - meaning it can be approximated by a power series. This includes a wide range of common activation functions like sigmoid, tanh, and softplus, but excludes piecewise polynomials like ReLU.

The significance is that neural network designers now have more freedom to choose activation functions that are well-suited to their specific problem, without worrying about the number of neurons required for accurate interpolation.

Technical Explanation

The paper analyzes the minimum number of neurons needed for a feedforward neural network to interpolate n generic input-output pairs from ℝd × ℝ. Previous works had shown that Θ(√n) neurons are sufficient for this task, but were limited to logistic, Heaviside, and rectified linear unit (ReLU) activation functions.

Using a different mathematical approach, the authors prove that Θ(√n) neurons are sufficient as long as the activation function is real-analytic at a point and not a polynomial there. This covers a much broader class of activation functions beyond the previously studied ones.

Importantly, this means neural network designers have more flexibility in choosing activation functions that are well-suited to their specific problem, without sacrificing the ability to accurately interpolate a set of input-output pairs with a relatively small number of neurons.

Critical Analysis

The paper makes an important theoretical contribution by expanding the range of activation functions that can be used in neural networks while still maintaining the Θ(√n) neuron count for interpolation. This provides more design freedom for practitioners.

However, the analysis is limited to the specific task of interpolating a set of generic input-output pairs. It does not address other important neural network capabilities like generalization, which is critical for real-world applications.

Additionally, while the authors prove the Θ(√n) bound holds for a broad class of activation functions, they do not provide any analysis on the practical implications of using different activation functions. The performance and training characteristics may vary significantly depending on the choice of activation.

Further research could investigate the empirical performance of neural networks with different real-analytic activation functions, beyond just theoretical bounds. Exploring the trade-offs and practical considerations would help practitioners make more informed choices when designing neural network architectures.

Conclusion

This paper demonstrates that feedforward neural networks only require a number of neurons proportional to the square root of the number of input-output pairs they need to interpolate, as long as the activation function is real-analytic and not a polynomial. This greatly expands the range of activation functions that can be used while maintaining this favorable neuron scaling.

The significance is that neural network designers now have more flexibility in choosing activation functions that are well-suited to their specific problem domain, without sacrificing the ability to accurately interpolate a set of input-output pairs with a relatively small number of neurons. This could lead to the development of more efficient and effective neural network architectures across a variety of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

On the power of graph neural networks and the role of the activation function

Sammy Khalife, Amitabh Basu

In this article we present new results about the expressivity of Graph Neural Networks (GNNs). We prove that for any GNN with piecewise polynomial activations, whose architecture size does not grow with the graph input sizes, there exists a pair of non-isomorphic rooted trees of depth two such that the GNN cannot distinguish their root vertex up to an arbitrary number of iterations. The proof relies on tools from the algebra of symmetric polynomials. In contrast, it was already known that unbounded GNNs (those whose size is allowed to change with the graph sizes) with piecewise polynomial activations can distinguish these vertices in only two iterations. It was also known prior to our work that with ReLU (piecewise linear) activations, bounded GNNs are weaker than unbounded GNNs [Aamand & Al., 2022]. Our approach adds to this result by extending it to handle any piecewise polynomial activation function, which goes towards answering an open question formulated by Grohe [Grohe,2021] more completely. Our second result states that if one allows activations that are not piecewise polynomial, then in two iterations a single neuron perceptron can distinguish the root vertices of any pair of nonisomorphic trees of depth two (our results hold for activations like the sigmoid, hyperbolic tan and others). This shows how the power of graph neural networks can change drastically if one changes the activation function of the neural networks. The proof of this result utilizes the Lindemann-Weierstrauss theorem from transcendental number theory.

5/8/2024

cs.LG

🧠

1-Lipschitz Neural Networks are more expressive with N-Activations

Bernd Prach, Christoph H. Lampert

A crucial property for achieving secure, trustworthy and interpretable deep learning systems is their robustness: small changes to a system's inputs should not result in large changes to its outputs. Mathematically, this means one strives for networks with a small Lipschitz constant. Several recent works have focused on how to construct such Lipschitz networks, typically by imposing constraints on the weight matrices. In this work, we study an orthogonal aspect, namely the role of the activation function. We show that commonly used activation functions, such as MaxMin, as well as all piece-wise linear ones with two segments unnecessarily restrict the class of representable functions, even in the simplest one-dimensional setting. We furthermore introduce the new N-activation function that is provably more expressive than currently popular activation functions. We provide code at https://github.com/berndprach/NActivation.

6/4/2024

cs.LG

🧠

Memory capacity of two layer neural networks with smooth activations

Liam Madden, Christos Thrampoulidis

Determining the memory capacity of two layer neural networks with $m$ hidden neurons and input dimension $d$ (i.e., $md+2m$ total trainable parameters), which refers to the largest size of general data the network can memorize, is a fundamental machine learning question. For activations that are real analytic at a point and, if restricting to a polynomial there, have sufficiently high degree, we establish a lower bound of $lfloor md/2rfloor$ and optimality up to a factor of approximately $2$. All practical activations, such as sigmoids, Heaviside, and the rectified linear unit (ReLU), are real analytic at a point. Furthermore, the degree condition is mild, requiring, for example, that $binom{k+d-1}{d-1}ge n$ if the activation is $x^k$. Analogous prior results were limited to Heaviside and ReLU activations -- our result covers almost everything else. In order to analyze general activations, we derive the precise generic rank of the network's Jacobian, which can be written in terms of Hadamard powers and the Khatri-Rao product. Our analysis extends classical linear algebraic facts about the rank of Hadamard powers. Overall, our approach differs from prior works on memory capacity and holds promise for extending to deeper models and other architectures.

5/3/2024

cs.LG

🤿

Approximation and interpolation of deep neural networks

Vlad-Raul Constantinescu, Ionel Popescu

In this paper, we prove that in the overparametrized regime, deep neural network provide universal approximations and can interpolate any data set, as long as the activation function is locally in $L^1(RR)$ and not an affine function. Additionally, if the activation function is smooth and such an interpolation networks exists, then the set of parameters which interpolate forms a manifold. Furthermore, we give a characterization of the Hessian of the loss function evaluated at the interpolation points. In the last section, we provide a practical probabilistic method of finding such a point under general conditions on the activation function.

4/26/2024

cs.LG stat.ML