A multiobjective continuation method to compute the regularization path of deep neural networks

2308.12044

Published 4/1/2024 by Augustina C. Amakor, Konstantin Sonntag, Sebastian Peitz

🤿

Abstract

Sparsity is a highly desired feature in deep neural networks (DNNs) since it ensures numerical efficiency, improves the interpretability of models (due to the smaller number of relevant features), and robustness. For linear models, it is well known that there exists a emph{regularization path} connecting the sparsest solution in terms of the $ell^1$ norm, i.e., zero weights and the non-regularized solution. Very recently, there was a first attempt to extend the concept of regularization paths to DNNs by means of treating the empirical loss and sparsity ($ell^1$ norm) as two conflicting criteria and solving the resulting multiobjective optimization problem for low-dimensional DNN. However, due to the non-smoothness of the $ell^1$ norm and the high number of parameters, this approach is not very efficient from a computational perspective for high-dimensional DNNs. To overcome this limitation, we present an algorithm that allows for the approximation of the entire Pareto front for the above-mentioned objectives in a very efficient manner for high-dimensional DNNs with millions of parameters. We present numerical examples using both deterministic and stochastic gradients. We furthermore demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization. To the best of our knowledge, this is the first algorithm to compute the regularization path for non-convex multiobjective optimization problems (MOPs) with millions of degrees of freedom.

Create account to get full access

The paper presents an efficient algorithm for approximating the Pareto front in high-dimensional deep neural networks (DNNs) with millions of parameters. The algorithm considers the empirical loss and sparsity ($\ell^1$ norm) as conflicting objectives and solves the resulting multiobjective optimization problem. Sparsity is a desirable feature in DNNs as it ensures numerical efficiency, improves model interpretability, and robustness.

Previous attempts to extend the concept of regularization paths to DNNs treated the empirical loss and sparsity as separate criteria, but this approach is computationally inefficient for high-dimensional DNNs due to the non-smoothness of the $\ell^1$ norm and the large number of parameters. The proposed algorithm overcomes this limitation and efficiently approximates the entire Pareto front for high-dimensional DNNs using both deterministic and stochastic gradients.

The authors demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization. This is the first algorithm to compute the regularization path for non-convex multiobjective optimization problems with millions of degrees of freedom.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛠️

Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

Chris Kolb, Christian L. Muller, Bernd Bischl, David Rugamer

We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. These non-smooth and possibly non-convex problems typically rely on solvers tailored to specific models and regularizers. In contrast, our method enables fully differentiable and approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning. The proposed optimization transfer comprises an overparameterization of selected parameters and a change of penalties. In the overparametrized problem, smooth surrogate regularization induces non-smooth, sparse regularization in the base parametrization. We prove that the surrogate objective is equivalent in the sense that it not only has identical global minima but also matching local minima, thereby avoiding the introduction of spurious solutions. Additionally, our theory establishes results of independent interest regarding matching local minima for arbitrary, potentially unregularized, objectives. We comprehensively review sparsity-inducing parametrizations across different fields that are covered by our general theory, extend their scope, and propose improvements in several aspects. Numerical experiments further demonstrate the correctness and effectiveness of our approach on several sparse learning problems ranging from high-dimensional regression to sparse neural network training.

4/30/2024

cs.LG stat.ML

🤿

Deep learning from strongly mixing observations: Sparse-penalized regularization and minimax optimality

William Kengne, Modou Wade

The explicit regularization and optimality of deep neural networks estimators from independent data have made considerable progress recently. The study of such properties on dependent data is still a challenge. In this paper, we carry out deep learning from strongly mixing observations, and deal with the squared and a broad class of loss functions. We consider sparse-penalized regularization for deep neural network predictor. For a general framework that includes, regression estimation, classification, time series prediction,$cdots$, oracle inequality for the expected excess risk is established and a bound on the class of Holder smooth functions is provided. For nonparametric regression from strong mixing data and sub-exponentially error, we provide an oracle inequality for the $L_2$ error and investigate an upper bound of this error on a class of Holder composition functions. For the specific case of nonparametric autoregression with Gaussian and Laplace errors, a lower bound of the $L_2$ error on this Holder composition class is established. Up to logarithmic factor, this bound matches its upper bound; so, the deep neural network estimator attains the minimax optimal rate.

6/13/2024

stat.ML cs.LG

Sparse deep neural networks for nonparametric estimation in high-dimensional sparse regression

Dongya Wu, Xin Li

Generalization theory has been established for sparse deep neural networks under high-dimensional regime. Beyond generalization, parameter estimation is also important since it is crucial for variable selection and interpretability of deep neural networks. Current theoretical studies concerning parameter estimation mainly focus on two-layer neural networks, which is due to the fact that the convergence of parameter estimation heavily relies on the regularity of the Hessian matrix, while the Hessian matrix of deep neural networks is highly singular. To avoid the unidentifiability of deep neural networks in parameter estimation, we propose to conduct nonparametric estimation of partial derivatives with respect to inputs. We first show that model convergence of sparse deep neural networks is guaranteed in that the sample complexity only grows with the logarithm of the number of parameters or the input dimension when the $ell_{1}$-norm of parameters is well constrained. Then by bounding the norm and the divergence of partial derivatives, we establish that the convergence rate of nonparametric estimation of partial derivatives scales as $mathcal{O}(n^{-1/4})$, a rate which is slower than the model convergence rate $mathcal{O}(n^{-1/2})$. To the best of our knowledge, this study combines nonparametric estimation and parametric sparse deep neural networks for the first time. As nonparametric estimation of partial derivatives is of great significance for nonlinear variable selection, the current results show the promising future for the interpretability of deep neural networks.

6/27/2024

stat.ML cs.LG

Geometric sparsification in recurrent neural networks

Wyatt Mackey, Ioannis Schizas, Jared Deighton, David L. Boothe, Jr., Vasileios Maroulas

A common technique for ameliorating the computational costs of running large neural models is sparsification, or the removal of neural connections during training. Sparse models are capable of maintaining the high accuracy of state of the art models, while functioning at the cost of more parsimonious models. The structures which underlie sparse architectures are, however, poorly understood and not consistent between differently trained models and sparsification schemes. In this paper, we propose a new technique for sparsification of recurrent neural nets (RNNs), called moduli regularization, in combination with magnitude pruning. Moduli regularization leverages the dynamical system induced by the recurrent structure to induce a geometric relationship between neurons in the hidden state of the RNN. By making our regularizing term explicitly geometric, we provide the first, to our knowledge, a priori description of the desired sparse architecture of our neural net. We verify the effectiveness of our scheme for navigation and natural language processing RNNs. Navigation is a structurally geometric task, for which there are known moduli spaces, and we show that regularization can be used to reach 90% sparsity while maintaining model performance only when coefficients are chosen in accordance with a suitable moduli space. Natural language processing, however, has no known moduli space in which computations are performed. Nevertheless, we show that moduli regularization induces more stable recurrent neural nets with a variety of moduli regularizers, and achieves high fidelity models at 98% sparsity.

6/11/2024

cs.LG