Collaborating Authors Machine Learning

Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models Machine Learning

We revisit convergence rates for maximum likelihood estimation (MLE) under finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing weights. However, the Wasserstein metric is only able to capture the worst-case convergence rate among the remaining fitted mixture components. We demonstrate that when the log-likelihood function is penalized to discourage vanishing mixing weights, stronger loss functions can be derived to resolve this shortcoming of the Wasserstein distance. These new loss functions accurately capture the heterogeneity in convergence rates of fitted mixture components, and we use them to sharpen existing pointwise and uniform convergence rates in various classes of mixture models. In particular, these results imply that a subset of the components of the penalized MLE typically converge significantly faster than could have been anticipated from past work. We further show that some of these conclusions extend to the traditional MLE. Our theoretical findings are supported by a simulation study to illustrate these improved convergence rates.

Optimal sizing of a holdout set for safe predictive model updating Machine Learning

Risk models in medical statistics and healthcare machine learning are increasingly used to guide clinical or other interventions. Should a model be updated after a guided intervention, it may lead to its own failure at making accurate predictions. The use of a `holdout set' -- a subset of the population that does not receive interventions guided by the model -- has been proposed to prevent this. Since patients in the holdout set do not benefit from risk predictions, the chosen size must trade off maximising model performance whilst minimising the number of held out patients. By defining a general loss function, we prove the existence and uniqueness of an optimal holdout set size, and introduce parametric and semi-parametric algorithms for its estimation. We demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented solution to the model update problem.

Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization Machine Learning

In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and--perhaps surprisingly--even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.

Information Theory with Kernel Methods Machine Learning

We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upper-bounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods.

Efficient and Reliable Probabilistic Interactive Learning with Structured Outputs Machine Learning

In this position paper, we study interactive learning for structured output spaces, with a focus on active learning, in which labels are unknown and must be acquired, and on skeptical learning, in which the labels are noisy and may need relabeling. These scenarios require expressive models that guarantee reliable and efficient computation of probabilistic quantities to measure uncertainty. We identify conditions under which a class of probabilistic models -- which we denote CRISPs -- meet all of these conditions, thus delivering tractable computation of the above quantities while preserving expressiveness. Building on prior work on tractable probabilistic circuits, we illustrate how CRISPs enable robust and efficient active and skeptical learning in large structured output spaces.

A hypothesis-driven method based on machine learning for neuroimaging data analysis Machine Learning

There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is assessed by randomly permuting image labels or by the use of random effect models that consider between-subject variability. These multivariate MLE-based statistical pipelines, whilst potentially more effective for detecting activations than hypotheses-driven methods, have lost their mathematical elegance, ease of interpretation, and spatial localization of the ubiquitous General linear Model (GLM). Recently, the estimation of the conventional GLM has been demonstrated to be connected to an univariate classification task when the design matrix is expressed as a binary indicator matrix. In this paper we explore the complete connection between the univariate GLM and MLE \emph{regressions}. To this purpose we derive a refined statistical test with the GLM based on the parameters obtained by a linear Support Vector Regression (SVR) in the \emph{inverse} problem (SVR-iGLM). Subsequently, random field theory (RFT) is employed for assessing statistical significance following a conventional GLM benchmark. Experimental results demonstrate how parameter estimations derived from each model (mainly GLM and SVR) result in different experimental design estimates that are significantly related to the predefined functional task. Moreover, using real data from a multisite initiative the proposed MLE-based inference demonstrates statistical power and the control of false positives, outperforming the regular GLM.

Fast Interpretable Greedy-Tree Sums (FIGS) Machine Learning

Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in many problems. Here, we propose Fast Interpretable Greedy-Tree Sums (FIGS), an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes the CART algorithm to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, thereby keeping both the size and number of its trees under control. When both are small, the fitted tree-sum can be easily visualized and written out by hand, making it highly interpretable. A partially oracle theoretical result hints at the potential for FIGS to overcome a key weakness of single-tree models by disentangling additive components of generative additive models, thereby reducing redundancy from repeated splits on the same feature. Furthermore, given oracle access to optimal tree structures, we obtain L2 generalization bounds for such generative models in the case of C1 component functions, matching known minimax rates in some cases. Extensive experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance (among all popular rule-based methods) when restricted to just a few splits (e.g. less than 20). We find empirically that FIGS is able to avoid repeated splits, and often provides more concise decision rules than fitted decision trees, without sacrificing predictive performance. All code and models are released in a full-fledged package on Github \url{}.

General Cyclical Training of Neural Networks Machine Learning

This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{}.

DeepHybrid: Deep Learning on Automotive Radar Spectra and Reflections for Object Classification Machine Learning

Automated vehicles need to detect and classify objects and traffic participants accurately. Reliable object classification using automotive radar sensors has proved to be challenging. We propose a method that combines classical radar signal processing and Deep Learning algorithms. The range-azimuth information on the radar reflection level is used to extract a sparse region of interest from the range-Doppler spectrum. This is used as input to a neural network (NN) that classifies different types of stationary and moving objects. We present a hybrid model (DeepHybrid) that receives both radar spectra and reflection attributes as inputs, e.g. radar cross-section. Experiments show that this improves the classification performance compared to models using only spectra. Moreover, a neural architecture search (NAS) algorithm is applied to find a resource-efficient and high-performing NN. NAS yields an almost one order of magnitude smaller NN than the manually-designed one while preserving the accuracy. The proposed method can be used for example to improve automatic emergency braking or collision avoidance systems.

Hybridizing Physical and Data-driven Prediction Methods for Physicochemical Properties Machine Learning

We present a generic way to hybridize physical and data-driven methods for predicting physicochemical properties. The approach `distills' the physical method's predictions into a prior model and combines it with sparse experimental data using Bayesian inference. We apply the new approach to predict activity coefficients at infinite dilution and obtain significant improvements compared to the data-driven and physical baselines and established ensemble methods from the machine learning literature.