Goto

Collaborating Authors

 Walk, Harro


Lossless Transformations and Excess Risk Bounds in Statistical Inference

arXiv.org Machine Learning

We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss in estimating a random variable from an observed feature vector and the minimum expected loss in estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless and show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a delta-lossless transformation and give sufficient conditions for a given transformation to be universally delta-lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottleneck, and deep learning, are also surveyed.


Repeated Observations for Classification

arXiv.org Artificial Intelligence

We study the problem nonparametric classification with repeated observations. Let $\bX$ be the $d$ dimensional feature vector and let $Y$ denote the label taking values in $\{1,\dots ,M\}$. In contrast to usual setup with large sample size $n$ and relatively low dimension $d$, this paper deals with the situation, when instead of observing a single feature vector $\bX$ we are given $t$ repeated feature vectors $\bV_1,\dots ,\bV_t $. Some simple classification rules are presented such that the conditional error probabilities have exponential convergence rate of convergence as $t\to\infty$. In the analysis, we investigate particular models like robust detection by nominal densities, prototype classification, linear transformation, linear classification, scaling.


Strongly universally consistent nonparametric regression and classification with privatised data

arXiv.org Machine Learning

In recent years there has been a surge of interest in data analysis methodology that is able to achieve strong statistical performance without comprimising the privacy and security of individual data holders. This has often been driven by applications in modern technology, for example by Google (Erlingsson et al., 2014), Apple (Tang et al., 2017), and Microsoft (Ding et al., 2017), but the study goes at least as far back as Warner (1965) and is often used in more traditional fields of clinical trials (Vu and Slavkovic, 2009, Dankar and El Emam, 2013) and census data (Machanavajjhala et al., 2008, Dwork, 2019). While there has long been an awareness that sensitive data must be anonymised, it has become apparent only relatively recently that simply removing names and addresses is insufficient in many cases (e.g.