Goto

Collaborating Authors

 Unser, Michael


On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

arXiv.org Artificial Intelligence

In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points.


Stability of Image-Reconstruction Algorithms

arXiv.org Artificial Intelligence

Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.


Asymptotic Stability in Reservoir Computing

arXiv.org Machine Learning

Reservoir Computing is a class of Recurrent Neural Networks with internal weights fixed at random. Stability relates to the sensitivity of the network state to perturbations. It is an important property in Reservoir Computing as it directly impacts performance. In practice, it is desirable to stay in a stable regime, where the effect of perturbations does not explode exponentially, but also close to the chaotic frontier where reservoir dynamics are rich. Open questions remain today regarding input regularization and discontinuous activation functions. In this work, we use the recurrent kernel limit to draw new insights on stability in reservoir computing. This limit corresponds to large reservoir sizes, and it already becomes relevant for reservoirs with a few hundred neurons. We obtain a quantitative characterization of the frontier between stability and chaos, which can greatly benefit hyperparameter tuning. In a broader sense, our results contribute to understanding the complex dynamics of Recurrent Neural Networks.


Measuring Complexity of Learning Schemes Using Hessian-Schatten Total-Variation

arXiv.org Machine Learning

In this paper, we introduce the Hessian-Schatten total-variation (HTV) -- a novel seminorm that quantifies the total "rugosity" of multivariate functions. Our motivation for defining HTV is to assess the complexity of supervised learning schemes. We start by specifying the adequate matrix-valued Banach spaces that are equipped with suitable classes of mixed-norms. We then show that HTV is invariant to rotations, scalings, and translations. Additionally, its minimum value is achieved for linear mappings, supporting the common intuition that linear regression is the least complex learning model. Next, we present closed-form expressions for computing the HTV of two general classes of functions. The first one is the class of Sobolev functions with a certain degree of regularity, for which we show that HTV coincides with the Hessian-Schatten seminorm that is sometimes used as a regularizer for image reconstruction. The second one is the class of continuous and piecewise linear (CPWL) functions. In this case, we show that the HTV reflects the total change in slopes between linear regions that have a common facet. Hence, it can be viewed as a convex relaxation (l1-type) of the number of linear regions (l0-type) of CPWL mappings. Finally, we illustrate the use of our proposed seminorm with some concrete examples.


Native Banach spaces for splines and variational inverse problems

arXiv.org Machine Learning

We propose a systematic construction of native Banach spaces for general spline-admissible operators ${\rm L}$. In short, the native space for ${\rm L}$ and the (dual) norm $\|\cdot\|_{\mathcal{X}'}$ is the largest space of functions $f: \mathbb{R}^d \to \mathbb{R}$ such that $\|{\rm L} f\|_{\mathcal{X}'}<\infty$, subject to the constraint that the growth-restricted null space of ${\rm L}$be finite-dimensional. This space, denoted by $\mathcal{X}'_{\rm L}$, is specified as the dual of the pre-native space $\mathcal{X}_{\rm L}$, which is itself obtained through a suitable completion process. The main difference with prior constructions (e.g., reproducing kernel Hilbert spaces) is that our approach involves test functions rather than sums of atoms (e.g, kernels), which makes it applicable to a much broader class of norms, including total variation. Under specific admissibility and compatibility hypotheses, we lay out the direct-sum topology of $\mathcal{X}_{\rm L}$ and $\mathcal{X}'_{\rm L}$, and identify the whole family of equivalent norms. Our construction ensures that the native space and its pre-dual are endowed with a fundamental Schwartz-Banach property. In practical terms, this means that $\mathcal{X}'_{\rm L}$ is rich enough to reproduce any function with an arbitrary degree of precision.


A unifying representer theorem for inverse problems and machine learning

arXiv.org Machine Learning

The standard approach for dealing with the ill-posedness of the training problem in machine learning and/or the reconstruction of a signal from a limited number of measurements is regularization. The method is applicable whenever the problem is formulated as an optimization task. The standard strategy consists in augmenting the original cost functional by an energy that penalizes solutions with undesirable behavior. The effect of regularization is very well understood when the penalty involves a Hilbertian norm. Another popular configuration is the use of an $\ell_1$-norm (or some variant thereof) that favors sparse solutions. In this paper, we propose a higher-level formulation of regularization within the context of Banach spaces. We present a general representer theorem that characterizes the solutions of a remarkably broad class of optimization problems. We then use our theorem to retrieve a number of known results in the literature---e.g., the celebrated representer theorem of machine leaning for RKHS, Tikhonov regularization, representer theorems for sparsity promoting functionals, the recovery of spikes---as well as a few new ones.


An L1 Representer Theorem for Multiple-Kernel Regression

arXiv.org Machine Learning

The theory of RKHS provides an elegant framework for supervised learning. It is the foundation of all kernel methods in machine learning. Implicit in its formulation is the use of a quadratic regularizer associated with the underlying inner product which imposes smoothness constraints. In this paper, we consider instead the generalized total-variation (gTV) norm as the sparsity-promoting regularizer. This leads us to propose a new Banach-space framework that justifies the use of generalized LASSO, albeit in a slightly modified version. We prove a representer theorem for multiple-kernel regression (MKR) with gTV regularization. The theorem states that the solutions of MKR have kernel expansions with adaptive positions, while the gTV norm enforces an $\ell_1$ penalty on the coefficients. We discuss the sparsity-promoting effect of the gTV norm which prevents redundancy in the multiple-kernel scenario.


A representer theorem for deep neural networks

arXiv.org Machine Learning

We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with the existing deep-ReLU and MaxOut architectures. It also suggests novel optimization challenges, while making the link with $\ell_1$ minimization and sparsity-promoting techniques explicit.


Learning Convex Regularizers for Optimal Bayesian Denoising

arXiv.org Machine Learning

We propose a data-driven algorithm for the maximum a posteriori (MAP) estimation of stochastic processes from noisy observations. The primary statistical properties of the sought signal is specified by the penalty function (i.e., negative logarithm of the prior probability density function). Our alternating direction method of multipliers (ADMM)-based approach translates the estimation task into successive applications of the proximal mapping of the penalty function. Capitalizing on this direct link, we define the proximal operator as a parametric spline curve and optimize the spline coefficients by minimizing the average reconstruction error for a given training set. The key aspects of our learning method are that the associated penalty function is constrained to be convex and the convergence of the ADMM iterations is proven. As a result of these theoretical guarantees, adaptation of the proposed framework to different levels of measurement noise is extremely simple and does not require any retraining. We apply our method to estimation of both sparse and non-sparse models of L\'{e}vy processes for which the minimum mean square error (MMSE) estimators are available. We carry out a single training session and perform comparisons at various signal-to-noise ratio (SNR) values. Simulations illustrate that the performance of our algorithm is practically identical to the one of the MMSE estimator irrespective of the noise power.


Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

Neural Information Processing Systems

We consider the estimation of an i.i.d.\ vector $\xbf \in \R^n$ from measurements $\ybf \in \R^m$ obtained by a general cascade model consisting of a known linear transform followed by a probabilistic componentwise (possibly nonlinear) measurement channel. We present a method, called adaptive generalized approximate message passing (Adaptive GAMP), that enables joint learning of the statistics of the prior and measurement channel along with estimation of the unknown vector $\xbf$. The proposed algorithm is a generalization of a recently-developed method by Vila and Schniter that uses expectation-maximization (EM) iterations where the posteriors in the E-steps are computed via approximate message passing. The techniques can be applied to a large class of learning problems including the learning of sparse priors in compressed sensing or identification of linear-nonlinear cascade models in dynamical systems and neural spiking processes. We prove that for large i.i.d.\ Gaussian transform matrices the asymptotic componentwise behavior of the adaptive GAMP algorithm is predicted by a simple set of scalar state evolution equations. This analysis shows that the adaptive GAMP method can yield asymptotically consistent parameter estimates, which implies that the algorithm achieves a reconstruction quality equivalent to the oracle algorithm that knows the correct parameter values. The adaptive GAMP methodology thus provides a systematic, general and computationally efficient method applicable to a large range of complex linear-nonlinear models with provable guarantees.