Country
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width
The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues. We observe that for wider networks, minimizing the loss with the gradient descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G. To explain this phenomenon, we show that when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.
Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)
Sorrenson, Peter, Rother, Carsten, Köthe, Ullrich
A central question of representation learning asks under which conditions it is possible to reconstruct the true latent variables of an arbitrarily complex generative process. Recent breakthrough work by Khemakhem et al. (2019) on nonlinear ICA has answered this question for a broad class of conditional generative processes. We extend this important result in a direction relevant for application to real-world data. First, we generalize the theory to the case of unknown intrinsic problem dimension and prove that in some special (but not very restrictive) cases, informative latent variables will be automatically separated from noise by an estimating model. Furthermore, the recovered informative latent variables will be in one-to-one correspondence with the true latent variables of the generating process, up to a trivial component-wise transformation. Second, we introduce a modification of the RealNVP invertible neural network architecture (Dinh et al., 2016) which is particularly suitable for this type of problem: the General Incompressible-flow Network (GIN). Experiments on artificial data and EMNIST demonstrate that theoretical predictions are indeed verified in practice. In particular, we provide a detailed set of exactly 22 informative latent variables extracted from EMNIST. Deep latent-variable models promise to unlock the key factors of variation within a dataset, opening a window to interpretation and granting the power to manipulate data in an intuitive fashion. The theory of identifiability in linear independent component analysis (ICA) (Comon, 1994) tells us when this is possible, if we restrict the model to a linear transformation, but until recently there was no corresponding theory for the highly nonlinear models needed to manipulate complex data. This changed with the recent breakthrough work by Khemakhem et al. (2019), which showed that under relatively mild conditions, it is possible to recover the joint data and latent space distribution, up to a simple transformation in the latent space. The key requirement is that the generating process is conditioned on a variable which is observed along with the data. This condition could be a class label, time index of a time series, or any other piece of information additional to the data.
Quantisation and Pruning for Neural Network Compression and Regularisation
Paupamah, Kimessha, James, Steven, Klein, Richard
Deep neural networks are typically too computationally expensive to run in real-time on consumer-grade hardware and low-powered devices. In this paper, we investigate reducing the computational and memory requirements of neural networks through network pruning and quantisation. We examine their efficacy on large networks like AlexNet compared to recent compact architectures: ShuffleNet and MobileNet. Our results show that pruning and quantisation compresses these networks to less than half their original size and improves their efficiency, particularly on MobileNet with a 7x speedup. We also demonstrate that pruning, in addition to reducing the number of parameters in a network, can aid in the correction of overfitting.
Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond
Chang, Tsung-Hui, Hong, Mingyi, Wai, Hoi-To, Zhang, Xinwei, Lu, Songtao
Distributed learning has become a critical enabler of the massively connected world envisioned by many. This article discusses four key elements of scalable distributed processing and real-time intelligence -- problems, data, communication and computation. Our aim is to provide a fresh and unique perspective about how these elements should work together in an effective and coherent manner. In particular, we provide a selective review about the recent techniques developed for optimizing non-convex models (i.e., problem classes), processing batch and streaming data (i.e., data types), over the networks in a distributed manner (i.e., communication and computation paradigm). We describe the intuitions and connections behind a core set of popular distributed algorithms, emphasizing how to trade off between computation and communication costs. Practical issues and future research directions will also be discussed. We are living in a highly connected world, and it will become exponentially more connected in a decade. These devices collect a huge amount of real-time data, perform complex computational tasks, and provide vital services which significantly improve our lives and enrich our collective productivity. THC, MH, HTW are ordered alphabetically, and contributed equally. MH is the corresponding author. THC is with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China. MH and XZ are with the ECE Department, University of Minnesota, MN, USA. HTW is with the Department of SEEM, The Chinese University of Hong Kong, Hong Kong SAR, China. SL is with IBM Research AI, IBM Thomas J. Watson Research Center Y orktown Heights, New Y ork 10598, USA.
Cram\'er-Rao Lower Bounds Arising from Generalized Csisz\'ar Divergences
Kumar, M. Ashok, Mishra, Kumar Vijay
We study the geometry of probability distributions with respect to a generalized family of Csisz\'ar $f$-divergences. A member of this family is the relative $\alpha$-entropy which is also a R\'enyi analog of relative entropy in information theory and known as logarithmic or projective power divergence in statistics. We apply Eguchi's theory to derive the Fisher information metric and the dual affine connections arising from these generalized divergence functions. The notion enables us to arrive at a more widely applicable version of the Cram\'{e}r-Rao inequality, which provides a lower bound for the variance of an estimator for an escort of the underlying parametric probability distribution. We then extend the Amari-Nagaoka's dually flat structure of the exponential and mixer models to other distributions with respect to the aforementioned generalized metric. We show that these formulations lead us to find unbiased and efficient estimators for the escort model.
Adversarial Disentanglement with Grouped Observations
We consider the disentanglement of the representations of the relevant attributes of the data (content) from all other factors of variations (style) using Variational Autoencoders. Some recent works addressed this problem by utilizing grouped observations, where the content attributes are assumed to be common within each group, while there is no any supervised information on the style factors. In many cases, however, these methods fail to prevent the models from using the style variables to encode content related features as well. This work supplements these algorithms with a method that eliminates the content information in the style representations. For that purpose the training objective is augmented to minimize an appropriately defined mutual information term in an adversarial way. Experimental results and comparisons on image datasets show that the resulting method can efficiently separate the content and style related attributes and generalizes to unseen data.
Learning Overlapping Representations for the Estimation of Individualized Treatment Effects
Zhang, Yao, Bellot, Alexis, van der Schaar, Mihaela
The choice of making an intervention depends on its potential benefit or harm in comparison to alternatives. Estimating the likely outcome of alternatives from observational data is a challenging problem as all outcomes are never observed, and selection bias precludes the direct comparison of differently intervened groups. Despite their empirical success, we show that algorithms that learn domain-invariant representations of inputs (on which to make predictions) are often inappropriate, and develop generalization bounds that demonstrate the dependence on domain overlap and highlight the need for invertible latent maps. Based on these results, we develop a deep kernel regression algorithm and posterior regularization framework that substantially outperforms the state-of-the-art on a variety of benchmarks data sets.
HumBug Zooniverse: a crowd-sourced acoustic mosquito dataset
Kiskin, Ivan, Cobb, Adam D., Wang, Lawrence, Roberts, Stephen
Mosquitoes are the only known vector of malaria, which leads to hundreds of thousands of deaths each year. Understanding the number and location of potential mosquito vectors is of paramount importance to aid the reduction of malaria transmission cases. In recent years, deep learning has become widely used for bioacoustic classification tasks. In order to enable further research applications in this field, we release a new dataset of mosquito audio recordings. With over a thousand contributors, we obtained 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events. We present an example use of the dataset, in which we train a convolutional neural network on log-Mel features, showcasing the information content of the labels. We hope this will become a vital resource for those researching all aspects of malaria, and add to the existing audio datasets for bioacoustic detection and signal processing.
Hydra: Preserving Ensemble Diversity for Model Distillation
Tran, Linh, Veeling, Bastiaan S., Roth, Kevin, Swiatkowski, Jakub, Dillon, Joshua V., Snoek, Jasper, Mandt, Stephan, Salimans, Tim, Nowozin, Sebastian, Jenatton, Rodolphe
Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each individual member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behaviour of the original ensemble over both in-domain and out-of-distribution tasks.
Deep Learning for ECG Segmentation
Moskalenko, Viktor, Zolotykh, Nikolai, Osipov, Grigory
We propose an algorithm for electrocardiogram (ECG) segmentation using a UNet-like full-convolutional neural network. The algorithm receives an arbitrary sampling rate ECG signal as an input, and gives a list of onsets and offsets of P and T waves and QRS complexes as output. Our method of segmentation differs from others in speed, a small number of parameters and a good generalization: it is adaptive to different sampling rates and it is generalized to various types of ECG monitors. The proposed approach is superior to other state-of-the-art segmentation methods in terms of quality. In particular, F1-measures for detection of onsets and offsets of P and T waves and for QRS-complexes are at least 97.8%, 99.5%, and 99.9%, respectively.