AITopics | Performance Analysis

Collaborating Authors

Performance Analysis

News Overviews Instructional Materials AI-Alerts Classics

Sparse Fisher's Linear Discriminant Analysis for Partially Labeled Data

arXiv.org Machine LearningSep-17-2015

Classification is an important tool with many useful applications. Among the many classification methods, Fisher's Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the covariance information. However, in the high-dimensional, low-sample size setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods designed to deal with high-dimensional data, they may not fully use the covariance information as LDA does. Hence in some situations, it is still desirable to use a model-based method such as LDA for classification. This article exploits the potential of LDA in more complicated data settings. In many real applications, it is costly to manually place labels on observations; hence it is often that only a small portion of labeled data is available while a large number of observations are left without a label. It is a great challenge to obtain good classification performance through the labeled data alone, especially when the dimension is greater than the size of the labeled data. In order to overcome this issue, we propose a semi-supervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data. They provide additional information which helps to boost the classification performance in some situations. A direct estimation method is used to reconstruct LDA and achieve the sparsity; meanwhile we employ the difference-convex algorithm to handle the non-convex loss function associated with the unlabeled data. Theoretical properties of the proposed classifier are studied. Our simulated examples help to understand when and how the information extracted from the unlabeled data can be useful. A real data example further illustrates the usefulness of the proposed method.

artificial intelligence, machine learning, unlabeled data, (18 more...)

arXiv.org Machine Learning

1509.05438

Country: North America > United States > New York (0.14)

Genre: Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Discriminant Analysis (0.61)

Add feedback

A Practioner's Guide to Evaluating Entity Resolution Results

Barnes, Matt

arXiv.org Machine LearningSep-14-2015

Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of these metrics are borrowed from multi-class classification and clustering domains, though some key differences exist differentiating entity resolution from general clustering. Menestrina et al. empirically showed rankings from these metrics often conflict with each other, thus our primary motivation for studying them. This paper provides practitioners the basic knowledge to begin evaluating their entity resolution results.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Machine Learning

1509.04238

Country: North America > United States (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback

Markov Boundary Discovery with Ridge Regularized Linear Models

Strobl, Eric V., Visweswaran, Shyam

arXiv.org Machine LearningSep-13-2015

Ridge regularized linear models (RRLMs), such as ridge regression and the SVM, are a popular group of methods that are used in conjunction with coefficient hypothesis testing to discover explanatory variables with a significant multivariate association to a response. However, many investigators are reluctant to draw causal interpretations of the selected variables due to the incomplete knowledge of the capabilities of RRLMs in causal inference. Under reasonable assumptions, we show that a modified form of RRLMs can get very close to identifying a subset of the Markov boundary by providing a worst-case bound on the space of possible solutions. The results hold for any convex loss, even when the underlying functional relationship is nonlinear, and the solution is not unique. Our approach combines ideas in Markov boundary and sufficient dimension reduction theory. Experimental results show that the modified RRLMs are competitive against state-of-the-art algorithms in discovering part of the Markov boundary from gene expression data.

artificial intelligence, machine learning, markov boundary, (19 more...)

arXiv.org Machine Learning

1509.03935

Country: North America > United States > California (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
(2 more...)

Add feedback

A More Powerful Two-Sample Test in High Dimensions using Random Projection

Lopes, Miles E., Jacob, Laurent J., Wainwright, Martin J.

arXiv.org Machine LearningSep-13-2015

We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Specifically, we propose a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T^2 statistic. Working under a high-dimensional framework with (p,n) tending to infinity, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from synthetic data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure's false positive rate with comparisons on high-dimensional gene expression data involving the discrimination of different types of cancer.

artificial intelligence, machine learning, procedure, (17 more...)

arXiv.org Machine Learning

1108.2401

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Supervised Collective Classification for Crowdsourcing

Chen, Pin-Yu, Lien, Chia-Wei, Chu, Fu-Jen, Ting, Pai-Shun, Cheng, Shin-Ming

arXiv.org Machine LearningSep-7-2015

Crowdsourcing utilizes the wisdom of crowds for collective classification via information (e.g., labels of an item) provided by labelers. Current crowdsourcing algorithms are mainly unsupervised methods that are unaware of the quality of crowdsourced data. In this paper, we propose a supervised collective classification algorithm that aims to identify reliable labelers from the training data (e.g., items with known labels). The reliability (i.e., weighting factor) of each labeler is determined via a saddle point algorithm. The results on several crowdsourced data show that supervised methods can achieve better classification accuracy than unsupervised methods, and our proposed method outperforms other algorithms.

artificial intelligence, labeler, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1109/GLOCOMW.2015.7414077

1507.06682

Country: North America > United States (0.68)

Genre: Research Report (0.64)

Industry: Education (0.93)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Add feedback

On Graphical Models via Univariate Exponential Family Distributions

Yang, Eunho, Ravikumar, Pradeep, Allen, Genevera I., Liu, Zhandong

arXiv.org Machine LearningSep-5-2015

Undirected graphical models, or Markov networks, are a popular class of statistical models, used in a wide variety of applications. Popular instances of this class include Gaussian graphical models and Ising models. In many settings, however, it might not be clear which subclass of graphical models to use, particularly for non-Gaussian and non-categorical data. In this paper, we consider a general sub-class of graphical models where the node-wise conditional distributions arise from exponential families. This allows us to derive multivariate graphical model distributions from univariate exponential family distributions, such as the Poisson, negative binomial, and exponential distributions. Our key contributions include a class of M-estimators to fit these graphical model distributions; and rigorous statistical analysis showing that these M-estimators recover the true graphical model structure exactly, with high probability. We provide examples of genomic and proteomic networks learned via instances of our class of graphical models derived from Poisson and exponential distributions.

artificial intelligence, graphical model, machine learning, (14 more...)

arXiv.org Machine Learning

1301.4183

Country: North America > United States > Texas (0.46)

Genre:

Research Report > New Finding (0.45)
Research Report > Experimental Study (0.45)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Systems & Languages (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Predicting SLA Violations in Real Time using Online Machine Learning

Ahmed, Jawwad, Johnsson, Andreas, Yanggratoke, Rerngvit, Ardelius, John, Flinta, Christofer, Stadler, Rolf

arXiv.org Machine LearningSep-4-2015

Next generation telecom services will execute on the telecom cloud, which combine the flexibility of today's computing clouds with the service quality of telecom systems. Real-time service assurance will become an integral part in transforming the general and flexible cloud into a robust and highly available cloud that can ensure low latency and agreed service quality to its customers. A service assurance system for telecom services must be able to detect and preferably also predict problems that may violate the agreed service quality. This is a complex task already in legacy systems and will become even more challenging when executing the services in the cloud. Further, the service assurance system must be able to remedy, in real time, these problems once detected. One promising approach to service assurance is based on machine learning, where the service quality and behavior is learned from observations of the system. The ambition is to do automated real-time predictions of the service quality in order to execute mitigation actions in a proactive manner. Machine learning has been used in the past to build prediction models for service quality assurance.

artificial intelligence, load trace, machine learning, (17 more...)

arXiv.org Machine Learning

1509.01386

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.69)

Industry:

Telecommunications (0.86)
Education > Educational Setting > Online (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Nonparametric Independence Testing for Small Sample Sizes

Ramdas, Aaditya, Wehbe, Leila

arXiv.org Machine LearningSep-2-2015

This paper deals with the problem of nonparametric independence testing, a fundamental decision-theoretic problem that asks if two arbitrary (possibly multivariate) random variables $X,Y$ are independent or not, a question that comes up in many fields like causality and neuroscience. While quantities like correlation of $X,Y$ only test for (univariate) linear independence, natural alternatives like mutual information of $X,Y$ are hard to estimate due to a serious curse of dimensionality. A recent approach, avoiding both issues, estimates norms of an \textit{operator} in Reproducing Kernel Hilbert Spaces (RKHSs). Our main contribution is strong empirical evidence that by employing \textit{shrunk} operators when the sample size is small, one can attain an improvement in power at low false positive rates. We analyze the effects of Stein shrinkage on a popular test statistic called HSIC (Hilbert-Schmidt Independence Criterion). Our observations provide insights into two recently proposed shrinkage estimators, SCOSE and FCOSE - we prove that SCOSE is (essentially) the optimal linear shrinkage method for \textit{estimating} the true operator; however, the non-linearly shrunk FCOSE usually achieves greater improvements in \textit{test power}. This work is important for more powerful nonparametric detection of subtle nonlinear dependencies for small samples.

artificial intelligence, estimator, machine learning, (17 more...)

arXiv.org Machine Learning

1406.1922

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry:

Government > Regional Government (0.46)
Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Encrypted statistical machine learning: new privacy preserving methods

Aslett, Louis J. M., Esperança, Pedro M., Holmes, Chris C.

arXiv.org Machine LearningAug-27-2015

We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests, involving a new cryptographic stochastic fraction estimator, and na\"{i}ve Bayes, involving a semi-parametric model for the class decision boundary, and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide detailed information about the computational practicalities of these and other FHE methods.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

1508.06845

Genre: Research Report (0.65)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

AUC Optimisation and Collaborative Filtering

Dhanjal, Charanpal, Gaudel, Romaric, Clemencon, Stephan

arXiv.org Machine LearningAug-25-2015

In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.

artificial intelligence, auc optimisation, machine learning, (14 more...)

arXiv.org Machine Learning

1508.06091

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback