Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Read original: arXiv:2406.11828 - Published 6/18/2024 by Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu
Total Score

0

🏋️

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper studies the computational and sample complexity of learning a target function with additive structure, where the function is a sum of non-linear single-index models (ridge functions) with diverse and near-orthogonal features.
  • The number of additive tasks grows with the dimensionality, motivated by classical additive models, recent representation learning theory, and large-scale pretraining.
  • The authors prove that a large subset of polynomial target functions can be efficiently learned by gradient descent training of a two-layer neural network, with polynomial statistical and computational complexity.
  • They also establish statistical query (SQ) lower bounds for both correlational SQ and full SQ algorithms, providing computational hardness results.

Plain English Explanation

The paper examines a type of mathematical function that can be broken down into a sum of simpler non-linear functions, each of which relies on a specific set of input features. The authors link this to the classical additive model literature, recent neural network representation learning theory, and large-scale pretraining of models that acquire many localized skills.

The key finding is that a large class of these additive functions, specifically polynomial functions, can be efficiently learned using a two-layer neural network and gradient descent training. This is true even when the number of these simpler function components grows with the dimensionality of the input, and the specific form of the component functions is unknown.

The authors also show that there are computational limits on learning these types of functions, by establishing lower bounds on the performance of certain algorithms. This provides a more complete picture of what is and is not possible when it comes to efficiently learning these additive, high-dimensional functions.

Technical Explanation

The target function $f_

$ has the additive form $f_
(x) = \frac{1}{\sqrt{M}} \sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1, f_2, ..., f_M$ are non-linear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features ${v_m}_{m=1}^M$. Crucially, the number of additive tasks $M$ grows with the dimensionality $d$ as $M \asymp d^\gamma$ for $\gamma \geq 0$.

The authors prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network. The statistical and computational complexity of their approach depends on $M$ and the information exponent of the $f_m$, despite the unknown link functions and growing $M$.

To complement this positive learnability result, the authors establish statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms. This demonstrates computational hardness for learning these types of additive functions in the SQ model, which captures the information available to many practical learning algorithms.

Critical Analysis

The paper makes an important contribution by rigorously analyzing the computational and sample complexity of learning additive, high-dimensional functions. The positive learnability result for polynomial $f_*$ is particularly noteworthy, as it suggests neural networks can efficiently learn a broad class of these structured functions.

However, the authors acknowledge that their results only apply to a specific subset of additive functions, and do not fully characterize the learnability of all possible $f_*$. Additionally, the SQ lower bounds indicate there are fundamental limits on the efficiency of learning these types of functions, which should motivate further research into alternative approaches.

It would also be valuable to see empirical validation of the theoretical findings, to better understand how they translate to practical machine learning scenarios. Exploring the performance of neural networks on additive function learning tasks, and comparing to other methods, could provide important insights.

Conclusion

This paper makes a significant contribution to our theoretical understanding of learning high-dimensional additive functions, a problem relevant to classical statistical models, neural network representation learning, and large-scale pretraining. The positive learnability result for polynomial $f_*$ and the corresponding computational hardness findings provide a more complete picture of what is possible when it comes to efficiently learning these types of structured functions. The insights from this work could have important implications for the design and analysis of machine learning algorithms that operate in high-dimensional, structured domains.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Total Score

0

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu

We study the computational and sample complexity of learning a target function $f_*:mathbb{R}^dtomathbb{R}$ with additive structure, that is, $f_*(x) = frac{1}{sqrt{M}}sum_{m=1}^M f_m(langle x, v_mrangle)$, where $f_1,f_2,...,f_M:mathbb{R}tomathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features ${v_m}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $Masymp d^gamma$ for $gammage 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of skills that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the information exponent of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.

Read more

6/18/2024

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit
Total Score

0

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu

We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbol{x}) = textstylesigma_*left(langleboldsymbol{x},boldsymbol{theta}rangleright)$ under isotropic Gaussian data in $mathbb{R}^d$, where the link function $sigma_*:mathbb{R}tomathbb{R}$ is an unknown degree $q$ polynomial with information exponent $p$ (defined as the lowest degree in the Hermite expansion). Prior works showed that gradient-based training of neural networks can learn this target with $ngtrsim d^{Theta(p)}$ samples, and such statistical complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary polynomial link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot dmathrm{polylog} d$, where constant $C(q)$ only depends on the degree of $sigma_*$, regardless of information exponent; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.

Read more

6/4/2024

Learning smooth functions in high dimensions: from sparse polynomials to deep neural networks
Total Score

0

Learning smooth functions in high dimensions: from sparse polynomials to deep neural networks

Ben Adcock, Simone Brugiapaglia, Nick Dexter, Sebastian Moraga

Learning approximations to smooth target functions of many variables from finite sets of pointwise samples is an important task in scientific computing and its many applications in computational science and engineering. Despite well over half a century of research on high-dimensional approximation, this remains a challenging problem. Yet, significant advances have been made in the last decade towards efficient methods for doing this, commencing with so-called sparse polynomial approximation methods and continuing most recently with methods based on Deep Neural Networks (DNNs). In tandem, there have been substantial advances in the relevant approximation theory and analysis of these techniques. In this work, we survey this recent progress. We describe the contemporary motivations for this problem, which stem from parametric models and computational uncertainty quantification; the relevant function classes, namely, classes of infinite-dimensional, Banach-valued, holomorphic functions; fundamental limits of learnability from finite data for these classes; and finally, sparse polynomial and DNN methods for efficiently learning such functions from finite data. For the latter, there is currently a significant gap between the approximation theory of DNNs and the practical performance of deep learning. Aiming to narrow this gap, we develop the topic of practical existence theory, which asserts the existence of dimension-independent DNN architectures and training strategies that achieve provably near-optimal generalization errors in terms of the amount of training data.

Read more

4/8/2024

On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries
Total Score

0

On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

Nirmit Joshi, Theodor Misiakiewicz, Nathan Srebro

The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($mathsf{SQ}$), which we call Differentiable Learning Queries ($mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of $mathsf{DLQ}$ for learning the support of a sparse function over generic product distributions. This complexity crucially depends on the loss function. For the squared loss, $mathsf{DLQ}$ matches the complexity of Correlation Statistical Queries $(mathsf{CSQ})$--potentially much worse than $mathsf{SQ}$. But for other simple loss functions, including the $ell_1$ loss, $mathsf{DLQ}$ always achieves the same complexity as $mathsf{SQ}$. We also provide evidence that $mathsf{DLQ}$ can indeed capture learning with (stochastic) gradient descent by showing it correctly describes the complexity of learning with a two-layer neural network in the mean field regime and linear scaling.

Read more

7/9/2024