Performance Analysis
Sparse Fisher's Linear Discriminant Analysis for Partially Labeled Data
Classification is an important tool with many useful applications. Among the many classification methods, Fisher's Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the covariance information. However, in the high-dimensional, low-sample size setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods designed to deal with high-dimensional data, they may not fully use the covariance information as LDA does. Hence in some situations, it is still desirable to use a model-based method such as LDA for classification. This article exploits the potential of LDA in more complicated data settings. In many real applications, it is costly to manually place labels on observations; hence it is often that only a small portion of labeled data is available while a large number of observations are left without a label. It is a great challenge to obtain good classification performance through the labeled data alone, especially when the dimension is greater than the size of the labeled data. In order to overcome this issue, we propose a semi-supervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data. They provide additional information which helps to boost the classification performance in some situations. A direct estimation method is used to reconstruct LDA and achieve the sparsity; meanwhile we employ the difference-convex algorithm to handle the non-convex loss function associated with the unlabeled data. Theoretical properties of the proposed classifier are studied. Our simulated examples help to understand when and how the information extracted from the unlabeled data can be useful. A real data example further illustrates the usefulness of the proposed method.
A Practioner's Guide to Evaluating Entity Resolution Results
Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of these metrics are borrowed from multi-class classification and clustering domains, though some key differences exist differentiating entity resolution from general clustering. Menestrina et al. empirically showed rankings from these metrics often conflict with each other, thus our primary motivation for studying them. This paper provides practitioners the basic knowledge to begin evaluating their entity resolution results.
Markov Boundary Discovery with Ridge Regularized Linear Models
Strobl, Eric V., Visweswaran, Shyam
Ridge regularized linear models (RRLMs), such as ridge regression and the SVM, are a popular group of methods that are used in conjunction with coefficient hypothesis testing to discover explanatory variables with a significant multivariate association to a response. However, many investigators are reluctant to draw causal interpretations of the selected variables due to the incomplete knowledge of the capabilities of RRLMs in causal inference. Under reasonable assumptions, we show that a modified form of RRLMs can get very close to identifying a subset of the Markov boundary by providing a worst-case bound on the space of possible solutions. The results hold for any convex loss, even when the underlying functional relationship is nonlinear, and the solution is not unique. Our approach combines ideas in Markov boundary and sufficient dimension reduction theory. Experimental results show that the modified RRLMs are competitive against state-of-the-art algorithms in discovering part of the Markov boundary from gene expression data.
A More Powerful Two-Sample Test in High Dimensions using Random Projection
Lopes, Miles E., Jacob, Laurent J., Wainwright, Martin J.
We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Specifically, we propose a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T^2 statistic. Working under a high-dimensional framework with (p,n) tending to infinity, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from synthetic data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure's false positive rate with comparisons on high-dimensional gene expression data involving the discrimination of different types of cancer.
Supervised Collective Classification for Crowdsourcing
Chen, Pin-Yu, Lien, Chia-Wei, Chu, Fu-Jen, Ting, Pai-Shun, Cheng, Shin-Ming
Crowdsourcing utilizes the wisdom of crowds for collective classification via information (e.g., labels of an item) provided by labelers. Current crowdsourcing algorithms are mainly unsupervised methods that are unaware of the quality of crowdsourced data. In this paper, we propose a supervised collective classification algorithm that aims to identify reliable labelers from the training data (e.g., items with known labels). The reliability (i.e., weighting factor) of each labeler is determined via a saddle point algorithm. The results on several crowdsourced data show that supervised methods can achieve better classification accuracy than unsupervised methods, and our proposed method outperforms other algorithms.
On Graphical Models via Univariate Exponential Family Distributions
Yang, Eunho, Ravikumar, Pradeep, Allen, Genevera I., Liu, Zhandong
Undirected graphical models, or Markov networks, are a popular class of statistical models, used in a wide variety of applications. Popular instances of this class include Gaussian graphical models and Ising models. In many settings, however, it might not be clear which subclass of graphical models to use, particularly for non-Gaussian and non-categorical data. In this paper, we consider a general sub-class of graphical models where the node-wise conditional distributions arise from exponential families. This allows us to derive multivariate graphical model distributions from univariate exponential family distributions, such as the Poisson, negative binomial, and exponential distributions. Our key contributions include a class of M-estimators to fit these graphical model distributions; and rigorous statistical analysis showing that these M-estimators recover the true graphical model structure exactly, with high probability. We provide examples of genomic and proteomic networks learned via instances of our class of graphical models derived from Poisson and exponential distributions.
Predicting SLA Violations in Real Time using Online Machine Learning
Ahmed, Jawwad, Johnsson, Andreas, Yanggratoke, Rerngvit, Ardelius, John, Flinta, Christofer, Stadler, Rolf
Next generation telecom services will execute on the telecom cloud, which combine the flexibility of today's computing clouds with the service quality of telecom systems. Real-time service assurance will become an integral part in transforming the general and flexible cloud into a robust and highly available cloud that can ensure low latency and agreed service quality to its customers. A service assurance system for telecom services must be able to detect and preferably also predict problems that may violate the agreed service quality. This is a complex task already in legacy systems and will become even more challenging when executing the services in the cloud. Further, the service assurance system must be able to remedy, in real time, these problems once detected. One promising approach to service assurance is based on machine learning, where the service quality and behavior is learned from observations of the system. The ambition is to do automated real-time predictions of the service quality in order to execute mitigation actions in a proactive manner. Machine learning has been used in the past to build prediction models for service quality assurance.
Nonparametric Independence Testing for Small Sample Sizes
This paper deals with the problem of nonparametric independence testing, a fundamental decision-theoretic problem that asks if two arbitrary (possibly multivariate) random variables $X,Y$ are independent or not, a question that comes up in many fields like causality and neuroscience. While quantities like correlation of $X,Y$ only test for (univariate) linear independence, natural alternatives like mutual information of $X,Y$ are hard to estimate due to a serious curse of dimensionality. A recent approach, avoiding both issues, estimates norms of an \textit{operator} in Reproducing Kernel Hilbert Spaces (RKHSs). Our main contribution is strong empirical evidence that by employing \textit{shrunk} operators when the sample size is small, one can attain an improvement in power at low false positive rates. We analyze the effects of Stein shrinkage on a popular test statistic called HSIC (Hilbert-Schmidt Independence Criterion). Our observations provide insights into two recently proposed shrinkage estimators, SCOSE and FCOSE - we prove that SCOSE is (essentially) the optimal linear shrinkage method for \textit{estimating} the true operator; however, the non-linearly shrunk FCOSE usually achieves greater improvements in \textit{test power}. This work is important for more powerful nonparametric detection of subtle nonlinear dependencies for small samples.
Encrypted statistical machine learning: new privacy preserving methods
Aslett, Louis J. M., Esperança, Pedro M., Holmes, Chris C.
We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests, involving a new cryptographic stochastic fraction estimator, and na\"{i}ve Bayes, involving a semi-parametric model for the class decision boundary, and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide detailed information about the computational practicalities of these and other FHE methods.
AUC Optimisation and Collaborative Filtering
Dhanjal, Charanpal, Gaudel, Romaric, Clemencon, Stephan
In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.