arXiv.org Machine Learning


Data Cleansing for Models Trained with SGD

arXiv.org Machine Learning

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.


Bayesian Modelling in Practice: Using Uncertainty to Improve Trustworthiness in Medical Applications

arXiv.org Machine Learning

The Intensive Care Unit (ICU) is a hospital department where machine learning has the potential to provide valuable assistance in clinical decision making. Classical machine learning models usually only provide point-estimates and no uncertainty of predictions. In practice, uncertain predictions should be presented to doctors with extra care in order to prevent potentially catastrophic treatment decisions. In this work we show how Bayesian modelling and the predictive uncertainty that it provides can be used to mitigate risk of misguided prediction and to detect out-of-domain examples in a medical setting. We derive analytically a bound on the prediction loss with respect to predictive uncertainty. The bound shows that uncertainty can mitigate loss. Furthermore, we apply a Bayesian Neural Network to the MIMIC-III dataset, predicting risk of mortality of ICU patients. Our empirical results show that uncertainty can indeed prevent potential errors and reliably identifies out-of-domain patients. These results suggest that Bayesian predictive uncertainty can greatly improve trustworthiness of machine learning models in high-risk settings such as the ICU.


Inverting Deep Generative models, One layer at a time

arXiv.org Machine Learning

We study the problem of inverting a deep generative model with ReLU activations. Inversion corresponds to finding a latent code vector that explains observed measurements as much as possible. In most prior works this is performed by attempting to solve a non-convex optimization problem involving the generator. In this paper we obtain several novel theoretical results for the inversion problem. We show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for multiple layers, inversion is NP-hard and the pre-image set can be non-convex. For generative models of arbitrary depth, we show that exact recovery is possible in polynomial time with high probability, if the layers are expanding and the weights are randomly selected. Very recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. We also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation demonstrates that we obtain better reconstructions when the latent dimension is large.


On The Radon--Nikodym Spectral Approach With Optimal Clustering

arXiv.org Machine Learning

Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.


Wasserstein Reinforcement Learning

arXiv.org Machine Learning

We propose behavior-driven optimization via Wasserstein distances (WDs) to improve several classes of state-of-the-art reinforcement learning (RL) algorithms. We show that WD regularizers acting on appropriate policy embeddings efficiently incorporate behavioral characteristics into policy optimization. We demonstrate that they improve Evolution Strategy methods by encouraging more efficient exploration, can be applied in imitation learning and to speed up training of Trust Region Policy Optimization methods. Since the exact computation of WDs is expensive, we develop approximate algorithms based on the combination of different methods: dual formulation of the optimal transport problem, alternating optimization and random feature maps, to effectively replace exact WD computations in the RL tasks considered. We provide theoretical analysis of our algorithms and exhaustive empirical evaluation in a variety of RL settings.


The Broad Optimality of Profile Maximum Likelihood

arXiv.org Machine Learning

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $\Theta(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; $\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; $\boldsymbol{\alpha}\textbf{-R\'enyi entropy estimation}$ For integer $\alpha>1$, the PML plug-in estimator has optimal $k^{1-1/\alpha}$ sample complexity; for non-integer $\alpha>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; $\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.


Understanding Generalization through Visualizations

arXiv.org Machine Learning

The power of neural networks lies in their ability to generalize to unseen data, yet the underlying reasons for this phenomenon remain elusive. Numerous rigorous attempts have been made to explain generalization, but available bounds are still quite loose, and analysis does not always lead to true understanding. The goal of this work is to make generalization more intuitive. Using visualization methods, we discuss the mystery of generalization, the geometry of loss landscapes, and how the curse (or, rather, the blessing) of dimensionality causes optimizers to settle into minima that generalize well.


Provable Gradient Variance Guarantees for Black-Box Variational Inference

arXiv.org Machine Learning

Recent variational inference methods use stochastic gradient estimators whose variance is not well understood. Theoretical guarantees for these estimators are important to understand when these methods will or will not work. This paper gives bounds for the common "reparameterization" estimators when the target is smooth and the variational family is a location-scale distribution. These bounds are unimprovable and thus provide the best possible guarantees under the stated assumptions.


Evaluating Protein Transfer Learning with TAPE

arXiv.org Machine Learning

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems.


Local Bures-Wasserstein Transport: A Practical and Fast Mapping Approximation

arXiv.org Machine Learning

Optimal transport (OT)-based methods have a wide range of applications and have attracted a tremendous amount of attention in recent years. However, most of the computational approaches of OT do not learn the underlying transport map. Although some algorithms have been proposed to learn this map, they rely on kernel-based methods, which makes them prohibitively slow when the number of samples increases. Here, we propose a way to learn an approximate transport map and a parametric approximation of the Wasserstein barycenter. We build an approximated transport mapping by leveraging the closed-form of Gaussian (Bures-Wasserstein) transport; we compute local transport plans between matched pairs of the Gaussian components of each density. The learned map generalizes to out-of-sample examples. We provide experimental results on simulated and real data, comparing our proposed method with other mapping estimation algorithms. Preliminary experiments suggest that our proposed method is not only faster, with a factor 80 overall running time, but it also requires fewer components than state-of-the-art methods to recover the support of the barycenter. From a practical standpoint, it is straightforward to implement and can be used with a conventional machine learning pipeline.