Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu

We study the computational and sample complexity of learning a target function $f_*:mathbb{R}^dtomathbb{R}$ with additive structure, that is, $f_*(x) = frac{1}{sqrt{M}}sum_{m=1}^M f_m(langle x, v_mrangle)$, where $f_1,f_2,...,f_M:mathbb{R}tomathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features ${v_m}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $Masymp d^gamma$ for $gammage 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of skills that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the information exponent of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.

6/18/2024

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu

We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbol{x}) = textstylesigma_*left(langleboldsymbol{x},boldsymbol{theta}rangleright)$ under isotropic Gaussian data in $mathbb{R}^d$, where the link function $sigma_*:mathbb{R}tomathbb{R}$ is an unknown degree $q$ polynomial with information exponent $p$ (defined as the lowest degree in the Hermite expansion). Prior works showed that gradient-based training of neural networks can learn this target with $ngtrsim d^{Theta(p)}$ samples, and such statistical complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary polynomial link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot dmathrm{polylog} d$, where constant $C(q)$ only depends on the degree of $sigma_*$, regardless of information exponent; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.

6/4/2024

Learning smooth functions in high dimensions: from sparse polynomials to deep neural networks

Ben Adcock, Simone Brugiapaglia, Nick Dexter, Sebastian Moraga

Learning approximations to smooth target functions of many variables from finite sets of pointwise samples is an important task in scientific computing and its many applications in computational science and engineering. Despite well over half a century of research on high-dimensional approximation, this remains a challenging problem. Yet, significant advances have been made in the last decade towards efficient methods for doing this, commencing with so-called sparse polynomial approximation methods and continuing most recently with methods based on Deep Neural Networks (DNNs). In tandem, there have been substantial advances in the relevant approximation theory and analysis of these techniques. In this work, we survey this recent progress. We describe the contemporary motivations for this problem, which stem from parametric models and computational uncertainty quantification; the relevant function classes, namely, classes of infinite-dimensional, Banach-valued, holomorphic functions; fundamental limits of learnability from finite data for these classes; and finally, sparse polynomial and DNN methods for efficiently learning such functions from finite data. For the latter, there is currently a significant gap between the approximation theory of DNNs and the practical performance of deep learning. Aiming to narrow this gap, we develop the topic of practical existence theory, which asserts the existence of dimension-independent DNN architectures and training strategies that achieve provably near-optimal generalization errors in terms of the amount of training data.

4/8/2024

On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

Nirmit Joshi, Theodor Misiakiewicz, Nathan Srebro

The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($mathsf{SQ}$), which we call Differentiable Learning Queries ($mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of $mathsf{DLQ}$ for learning the support of a sparse function over generic product distributions. This complexity crucially depends on the loss function. For the squared loss, $mathsf{DLQ}$ matches the complexity of Correlation Statistical Queries $(mathsf{CSQ})$--potentially much worse than $mathsf{SQ}$. But for other simple loss functions, including the $ell_1$ loss, $mathsf{DLQ}$ always achieves the same complexity as $mathsf{SQ}$. We also provide evidence that $mathsf{DLQ}$ can indeed capture learning with (stochastic) gradient descent by showing it correctly describes the complexity of learning with a two-layer neural network in the mean field regime and linear scaling.

7/9/2024

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Overview

Plain English Explanation

Technical Explanation

Critical Analysis

Conclusion

Related Papers