Country
Computationally Efficient Neural Image Compression
Johnston, Nick, Eban, Elad, Gordon, Ariel, Ballé, Johannes
Image compression using neural networks have reached or exceeded non-neural methods (such as JPEG, WebP, BPG). While these networks are state of the art in ratedistortion performance, computational feasibility of these models remains a challenge. We apply automatic network optimization techniques to reduce the computational complexity of a popular architecture used in neural image compression, analyze the decoder complexity in execution runtime and explore the trade-offs between two distortion metrics, rate-distortion performance and run-time performance to design and research more computationally efficient neural image compression. We find that our method decreases the decoder run-time requirements by over 50% for a stateof-the-art neural architecture.
RealMix: Towards Realistic Semi-Supervised Deep Learning Algorithms
Nair, Varun, Alonso, Javier Fuentes, Beltramelli, Tony
Semi-Supervised Learning (SSL) algorithms have shown great potential in training regimes when access to labeled data is scarce but access to unlabeled data is plentiful. However, our experiments illustrate several shortcomings that prior SSL algorithms suffer from. In particular, poor performance when unlabeled and labeled data distributions differ. To address these observations, we develop RealMix, which achieves state-of-the-art results on standard benchmark datasets across different labeled and unlabeled set sizes while overcoming the aforementioned challenges. Notably, RealMix achieves an error rate of 9.79% on CIFAR10 with 250 labels and is the only SSL method tested able to surpass baseline performance when there is significant mismatch in the labeled and unlabeled data distributions. RealMix demonstrates how SSL can be used in real world situations with limited access to both data and compute and guides further research in SSL with practical applicability in mind.
Interleaved Composite Quantization for High-Dimensional Similarity Search
Khoram, Soroosh, Wright, Stephen J, Li, Jing
As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encodings can be used during query processing to find approximate nearest neighbors of the query point quickly. Search speed can be improved by using shorter codes, but shorter codes have higher quantization error, leading to degraded precision. In this work, we propose the Interleaved Composite Quantization (ICQ) which achieves fast similarity search without using shorter codes. In ICQ, a small subset of the code is used to approximate the distances, with complete codes being used only when necessary. Our method effectively reduces both code length and quantization error . Furthermore, ICQ is compatible with several recently proposed techniques for reducing quantization error and can be used in conjunction with these other techniques to improve results. W e confirm these claims and show strong empirical performance of ICQ using several synthetic and real-word datasets.
Clusters in Explanation Space: Inferring disease subtypes from model explanations
Schulz, Marc-Andre, Chapman-Rounds, Matt, Verma, Manisha, Bzdok, Danilo, Georgatzis, Konstantinos
Identification of disease subtypes and corresponding biomarkers can substantially improve clinical diagnosis and treatment selection. Discovering these subtypes in noisy, high dimensional biomedical data is often impossible for humans and challenging for machines. We introduce a new approach to facilitate the discovery of disease subtypes: Instead of analyzing the original data, we train a diagnostic classifier (healthy vs. diseased) and extract instance-wise explanations for the classifier's decisions. The distribution of instances in the explanation space of our diagnostic classifier amplifies the different reasons for belonging to the same class - resulting in a representation that is uniquely useful for discovering latent subtypes. We compare our ability to recover subtypes via cluster analysis on model explanations to classical cluster analysis on the original data. In multiple datasets with known ground-truth subclasses, most compellingly on UK Biobank brain imaging data and transcriptome data from the Cancer Genome Atlas, we show that cluster analysis on model explanations substantially outperforms the classical approach. While we believe clustering in explanation space to be particularly valuable for inferring disease subtypes, the method is more general and applicable to any kind of sub-type identification.
Incremental ELMVIS for unsupervised learning
Akusok, Anton, Eirola, Emil, Miche, Yoan, Oliver, Ian, Björk, Kaj-Mikael, Gritsenko, Andrey, Baek, Stephen, Lendasse, Amaury
An incremental version of the ELMVIS+ method is proposed in this paper. It iteratively selects a few best fitting data samples from a large pool, and adds them to the model. The method keeps high speed of ELMVIS+ while allowing for much larger possible sample pools due to lower memory requirements. The extension is useful for reaching a better local optimum with greedy optimization of ELMVIS, and the data structure can be specified in semi-supervised optimization. The major new application of incremental ELMVIS is not to visualization, but to a general dataset processing. The method is capable of learning dependencies from non-organized unsupervised data -- either reconstructing a shuffled dataset, or learning dependencies in complex high-dimensional space. The results are interesting and promising, although there is space for improvements.
Generalized Residual Ratio Thresholding
Kallummil, Sreejith, Kalyani, Sheetal
Simultaneous orthogonal matching pursuit (SOMP) and block OMP (BOMP) are two widely used techniques for sparse support recovery in multiple measurement vector (MMV) and block sparse (BS) models respectively. For optimal performance, both SOMP and BOMP require \textit{a priori} knowledge of signal sparsity or noise variance. However, sparsity and noise variance are unavailable in most practical applications. This letter presents a novel technique called generalized residual ratio thresholding (GRRT) for operating SOMP and BOMP without the \textit{a priori} knowledge of signal sparsity and noise variance and derive finite sample and finite signal to noise ratio (SNR) guarantees for exact support recovery. Numerical simulations indicate that GRRT performs similar to BOMP and SOMP with \textit{a priori} knowledge of signal and noise statistics.
Comparison of Classification Methods for Very High-Dimensional Data in Sparse Random Projection Representation
Machine learning is a mature scientific field with lots of theoretical results, established algorithms and processes that address various supervised and unsupervised problems using the provided data. In theoretical research, such data is generated in a convenient way, or various methods are compared on standard benchmark problems - where data samples are represented as dense real-valued vectors of fixed and relatively low length. Practical applications represented by such standard datasets can successfully be solved by one of a myriad of existing machine learning methods and their implementations. However, the most impact of machine learning is currently in the big data field with the problems that are well explained in natural language ("Find malicious files", "Is that website safe to browse?") but are hard to encode numerically. Data samples in these problems have distinct features coming from a huge unordered set of possible features. Same approach can cover a frequent case of missing feature values [10, 28].
The Brier Score under Administrative Censoring: Problems and Solutions
Kvamme, Håvard, Borgan, Ørnulf
Box 1053 Blindern 0316 Oslo, Norway Abstract The Brier score is commonly used for evaluating probability predictions. In survival analysis, with right-censored observations of the event times, this score can be weighted by the inverse probability of censoring (IPCW) to retain its original interpretation. It is common practice to estimate the censoring distribution with the Kaplan-Meier estimator, even though it assumes that the censoring distribution is independent of the covariates. This paper discusses the general impact of the censoring estimates on the Brier score and shows that the estimation of the censoring distribution can be problematic. In particular, when the censoring times can be identified from the covariates, the IPCW score is no longer valid. For administratively censored data, where the potential censoring times are known for all individuals, we propose an alternative version of the Brier score. This administrative Brier score does not require estimation of the censoring distribution and is valid even if the censoring times can be identified from the covariates. Keywords: survival analysis, time-to-event-prediction, customer churn, inverse probability weighting, progressive type I censoring 1. Introduction Recently, there has been an increasing interest in combining machine learning methodology with survival analysis for improved time-to-event prediction. Also worth mentioning is the Random Survival Forest (Ishwaran et al., 2008) which makes decision trees based on the log-rank test and estimates the cumulative hazards with the Nelson-Aalen estimator. Although these methods are available for right-censored event times, a substantial part of the machine learning community is not familiar with survival analysis and might find it reasonable to instead apply binary classifiers for time-to-event prediction. In short, a binary classifier estimates the probability that an individual experience the event by time t, and can be fitted by disregarding individuals censored before that time. Arguably, the two most common evaluation criteria for survival predictions are the inverse probability of censoring weighted (IPCW) Brier score (Graf et al., 1999; Gerds and Schumacher, 2006) and different versions of the concordance index (Harrell Jr et al., 1982; Antolini et al., 2005; Uno et al., 2011; Gerds et al., 2013).
Analytic expressions for the output evolution of a deep neural network
Anastasia Borovykh December 19, 2019 Abstract We present a novel methodology based on a Taylor expansion of the network output for obtaining analytical expressions for the expected value of the network weights and output under stochastic training. Using these analytical expressions the effects of the hyperparameters and the noise variance of the optimization algorithm on the performance of the deep neural network are studied. In the early phases of training with a small noise coefficient, the output is equivalent to a linear model. In this case the network can generalize better due to the noise preventing the output from fully converging on the train data, however the noise does not result in any explicit regularization. In the later training stages, when higher order approximations are required, the impact of the noise becomes more significant, i.e. in a model which is nonlinear in the weights noise can regularize the output function resulting in better generalization as witnessed by its influence on the weight Hessian, a commonly used metric for generalization capabilities. Keywords: deep learning; Taylor expansion; stochastic gradient descent; regularization; generalization 1 Introduction With the large number of applications which are nowadays in some way using deep learning, it is of significant value to gain insight into the output evolution of a deep neural network and the effects that the model architecture and optimization algorithm have on it. A deep neural network is a complex model due to the nonlinear dependencies and the large number of parameters in the model. Understanding the network output and its generalization capabilities, i.e. how well a model optimized on train data will be able to perform on unseen test data, is thus a complex task. One way of gaining insight into the network is by studying it in a large-parameter limit, a setting in which its dynamics becomes analytically tractable. Such limits have been considered in e.g. The generalization capabilities and the definition of various quantities that measure these have been studied extensively. Previous work has shown that the norm [3], [27], [19], the width of a minimum in weight space [11], [34], the input sensitivity [28] and a model's compressibility [2] can be related (either theoretically or in practice) to the model's complexity and thus its ability to perform well on unseen data. Furthermore, it has been noted that the generalization capabilities can be influenced by the optimization algorithm used to train the model, e.g. it can be used to bias the model into configurations that are more robust to noise and have lower model complexity, see e.g. Furthermore, it has been observed that certain parameters of stochastic gradient descent (SGD) can be used to control the generalization error and the data fit, see e.g.
Distributional Reinforcement Learning for Energy-Based Sequential Models
Parshakova, Tetiana, Andreoli, Jean-Marc, Dymetman, Marc
Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.