Balestriero, Randall
The Recurrent Neural Tangent Kernel
Alemohammad, Sina, Wang, Zichao, Balestriero, Randall, Baraniuk, Richard
The study of deep networks (DNs) in the infinite-width limit, via the so-called Neural Tangent Kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DN architecture remains to be kernelized, namely, the Recurrent Neural Network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which sheds new insights into the behavior of overparametrized RNNs, including how different time steps are weighted by the RNTK to form the output under different initialization parameters and nonlinearity choices, and how inputs of different lengths are treated. We demonstrate via a number of experiments that the RNTK offers significant performance gains over other kernels, including standard NTKs across a range of different data sets. A unique benefit of the RNTK is that it is agnostic to the length of the input, in stark contrast to other kernels.
Provable Finite Data Generalization with Group Autoencoder
Cosentino, Romain, Balestriero, Randall, Baraniuk, Richard, Aazhang, Behnaam
Deep Autoencoders (AEs) provide a versatile framework to learn a compressed, interpretable, or structured representation of data. As such, AEs have been used extensively for denoising, compression, data completion as well as pre-training of Deep Networks (DNs) for various tasks such as classification. By providing a careful analysis of current AEs from a spline perspective, we can interpret the input-output mapping, in turn allowing us to derive conditions for generalization and reconstruction guarantee. By assuming a Lie group structure on the data at hand, we are able to derive a novel regularization of AEs, allowing for the first time to ensure the generalization of AEs in the finite training set case. We validate our theoretical analysis by demonstrating how this regularization significantly increases the generalization of the AE on various datasets.
Ensembles of Generative Adversarial Networks for Disconnected Data
Luzi, Lorenzo, Balestriero, Randall, Baraniuk, Richard G.
Most current computer vision datasets are composed of disconnected sets, such as images from different classes. We prove that distributions of this type of data cannot be represented with a continuous generative network without error. They can be represented in two ways: With an ensemble of networks or with a single network with truncated latent space. We show that ensembles are more desirable than truncated distributions in practice. We construct a regularized optimization problem that establishes the relationship between a single continuous GAN, an ensemble of GANs, conditional GANs, and Gaussian Mixture GANs. This regularization can be computed efficiently, and we show empirically that our framework has a performance sweet spot which can be found with hyperparameter tuning. This ensemble framework allows better performance than a single continuous GAN or cGAN while maintaining fewer total parameters.
Analytical Probability Distributions and EM-Learning for Deep Generative Networks
Balestriero, Randall, Paris, Sebastien, Baraniuk, Richard G.
Deep Generative Networks (DGNs) with probabilistic modeling of their output and latent space are currently trained via Variational Autoencoders (VAEs). In the absence of a known analytical form for the posterior and likelihood expectation, VAEs resort to approximations, including (Amortized) Variational Inference (AVI) and Monte-Carlo (MC) sampling. We exploit the Continuous Piecewise Affine (CPA) property of modern DGNs to derive their posterior and marginal distributions as well as the latter's first moments. These findings enable us to derive an analytical Expectation-Maximization (EM) algorithm that enables gradient-free DGN learning. We demonstrate empirically that EM training of DGNs produces greater likelihood than VAE training. Our findings will guide the design of new VAE AVI that better approximate the true posterior and open avenues to appply standard statistical tools for model comparison, anomaly detection, and missing data imputation.
A Hessian Based Complexity Measure for Deep Networks
Javadi, Hamid, Balestriero, Randall, Baraniuk, Richard
Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the overparameterized regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and role of (explicit or implicit) regularization in these networks is of great importance. Inspired by the seminal work of Donoho and Grimes in manifold learning, we develop a new measure for the complexity of the function generated by a deep network based on the integral of the norm of the tangent Hessian. This complexity measure can be used to quantify the irregularity of the function a deep network fits to training data or as a regularization penalty for deep network learning. Indeed, we show that the oft-used heuristic of data augmentation imposes an implicit Hessian regularization during learning. We demonstrate the utility of our new complexity measure through a range of learning experiments.
The Geometry of Deep Networks: Power Diagram Subdivision
Balestriero, Randall, Cosentino, Romain, Aazhang, Behnaam, Baraniuk, Richard
We study the geometry of deep (neural) networks (DNs) with piecewise affine and convex nonlinearities. The layers of such DNs have been shown to be {\em max-affine spline operators} (MASOs) that partition their input space and apply a region-dependent affine mapping to their input to produce their output. We demonstrate that each MASO layer's input space partitioning corresponds to a {\em power diagram} (an extension of the classical Voronoi tiling) with a number of regions that grows exponentially with respect to the number of units (neurons). We further show that a composition of MASO layers (e.g., the entire DN) produces a progressively subdivided power diagram and provide its analytical form. The subdivision process constrains the affine maps on the (exponentially many) power diagram regions to greatly reduce their complexity. For classification problems, we obtain a formula for a MASO DN's decision boundary in the input space plus a measure of its curvature that depends on the DN's nonlinearities, weights, and architecture. Numerous numerical experiments support and extend our theoretical results.
From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference
Balestriero, Randall, Baraniuk, Richard G.
Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the r\^ole played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and $K$-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. {\em This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).} We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural "hard" VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding "soft" VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a $\beta$-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a $\beta$-VQ DN nonlinearity is the {\em swish} nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.
A Spline Theory of Deep Networks (Extended Version)
Balestriero, Randall, Baraniuk, Richard
We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classifi- cation performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of a paper with the same title that will appear at ICML 2018.)
Semi-Supervised Learning Enabled by Multiscale Deep Neural Network Inversion
Balestriero, Randall, Glotin, Herve, Baraniuk, Richard
Deep Neural Networks (DNNs) provide state-of-the-art solutions in several difficult machine perceptual tasks. However, their performance relies on the availability of a large set of labeled training data, which limits the breadth of their applicability. Hence, there is a need for new {\em semi-supervised learning} methods for DNNs that can leverage both (a small amount of) labeled and unlabeled training data. In this paper, we develop a general loss function enabling DNNs of any topology to be trained in a semi-supervised manner without extra hyper-parameters. As opposed to current semi-supervised techniques based on topology-specific or unstable approaches, ours is both robust and general. We demonstrate that our approach reaches state-of-the-art performance on the SVHN ($9.82\%$ test error, with $500$ labels and wide Resnet) and CIFAR10 (16.38% test error, with 8000 labels and sigmoid convolutional neural network) data sets.
Overcomplete Frame Thresholding for Acoustic Scene Analysis
Cosentino, Romain, Balestriero, Randall, Baraniuk, Richard, Patel, Ankit
In this work, we derive a generic overcomplete frame thresholding scheme based on risk minimization. Overcomplete frames being favored for analysis tasks such as classification, regression or anomaly detection, we provide a way to leverage those optimal representations in real-world applications through the use of thresholding. We validate the method on a large scale bird activity detection task via the scattering network architecture performed by means of continuous wavelets, known for being an adequate dictionary in audio environments.