Goto

Collaborating Authors

 Accuracy


A Hierarchical Graphical Model for Record Linkage

arXiv.org Machine Learning

The task of matching co-referent records is known among other names as rocord linkage. For large record-linkage problems, often there is little or no labeled data available, but unlabeled data shows a reasonable clear structure. For such problems, unsupervised or semi-supervised methods are preferable to supervised methods. In this paper, we describe a hierarchical graphical model framework for the linakge-problem in an unsupervised setting. In addition to proposing new methods, we also cast existing unsupervised probabilistic record-linkage methods in this framework. Some of the techniques we propose to minimize overfitting in the above model are of interest in the general graphical model setting. We describe a method for incorporating monotinicity constraints in a graphical model. We also outline a bootstrapping approach of using "single-field" classifiers to noisily label latent variables in a hierarchical model. Experimental results show that our proposed unsupervised methods perform quite competitively even with fully supervised record-linkage methods.


Biogeography-Based Informative Gene Selection and Cancer Classification Using SVM and Random Forests

arXiv.org Machine Learning

Microarray cancer gene expression data comprise of very high dimensions. Reducing the dimensions helps in improving the overall analysis and classification performance. We propose two hybrid techniques, Biogeography - based Optimization - Random Forests (BBO - RF) and BBO - SVM (Support Vector Machines) with gene ranking as a heuristic, for microarray gene expression analysis. This heuristic is obtained from information gain filter ranking procedure. The BBO algorithm generates a population of candidate subset of genes, as part of an ecosystem of habitats, and employs the migration and mutation processes across multiple generations of the population to improve the classification accuracy. The fitness of each gene subset is assessed by the classifiers - SVM and Random Forests. The performances of these hybrid techniques are evaluated on three cancer gene expression datasets retrieved from the Kent Ridge Biomedical datasets collection and the libSVM data repository. Our results demonstrate that genes selected by the proposed techniques yield classification accuracies comparable to previously reported algorithms.


An Introduction to Artificial Prediction Markets for Classification

arXiv.org Machine Learning

Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artificial prediction markets for supervised learning of conditional probability estimators. The artificial prediction market is a novel method for fusing the prediction information of features or trained classifiers, where the fusion result is the contract price on the possible outcomes. The market can be trained online by updating the participants' budgets using training examples. Inspired by the real prediction markets, the equations that govern the market are derived from simple and reasonable assumptions. Efficient numerical algorithms are presented for solving these equations. The obtained artificial prediction market is shown to be a maximum likelihood estimator. It generalizes linear aggregation, existent in boosting and random forest, as well as logistic regression and some kernel methods. Furthermore, the market mechanism allows the aggregation of specialized classifiers that participate only on specific instances. Experimental comparisons show that the artificial prediction markets often outperform random forest and implicit online learning on synthetic data and real UCI datasets. Moreover, an extensive evaluation for pelvic and abdominal lymph node detection in CT data shows that the prediction market improves adaboost's detection rate from 79.6% to 81.2% at 3 false positives/volume.


PAC-Bayesian Majority Vote for Late Classifier Fusion

arXiv.org Machine Learning

A lot of attention has been devoted to multimedia indexing over the past few years. In the literature, we often consider two kinds of fusion schemes: The early fusion and the late fusion. In this paper we focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic program named MinCq coming from the Machine Learning PAC-Bayes theory. MinCq looks for the weighted combination, over a set of real-valued functions seen as voters, leading to the lowest misclassification rate, while making use of the voters' diversity. We provide evidence that this method is naturally adapted to late fusion procedure. We propose an extension of MinCq by adding an order- preserving pairwise loss for ranking, helping to improve Mean Averaged Precision measure. We confirm the good behavior of the MinCq-based fusion approaches with experiments on a real image benchmark.


A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery

AI Magazine

Labor monitoring is crucial in modern health care, as it can be used to detect (and help avoid) significant problems with the fetus. In this article we focus on detecting hypoxia (or oxygen deprivation), a very serious condition that can arise from different pathologies and can lead to life-long disability and death. We present a novel approach to hypoxia detection based on recordings of the uterine pressure and fetal heart rate, which are obtained using standard labor monitoring devices. The key idea is to learn models of the fetal response to signals from its environment. Then, we use the parameters of these models as attributes in a binary classification problem. A running count of pathological classifications over several time periods is taken to provide the current label for the fetus. We use a unique database of real clinical recordings, both from normal and pathological cases. Our approach classifies correctly more than half the pathological cases, 1.5 hours before delivery. These are cases that were missed by clinicians; early detection of this type would have allowed the physician to perform a Caesarean section, possibly avoiding the negative outcome.


Machine Learning and Sensor Fusion for Estimating Continuous Energy Expenditure

AI Magazine

In this article we provide insight into the BodyMedia FIT armband system โ€” a wearable multi-sensor technology that continuously monitors physiological events related to energy expenditure for weight management using machine learning and data modeling methods. Since becoming commercially available in 2001, more than half a million users have used the system to track their physiological parameters and to achieve their individual health goals including weight-loss. We describe several challenges that arise in applying machine learning techniques to the health care domain and present various solutions utilized in the armband system. We demonstrate how machine learning and multi-sensor data fusion techniques are critical to the systemโ€™s success.


Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses

arXiv.org Machine Learning

The problem of bipartite ranking, where instances are labeled positive or negative and the goal is to learn a scoring function that minimizes the probability of mis-ranking a pair of positive and negative instances (or equivalently, that maximizes the area under the ROC curve), has been widely studied in recent years. A dominant theoretical and algorithmic framework for the problem has been to reduce bipartite ranking to pairwise classification; in particular, it is well known that the bipartite ranking regret can be formulated as a pairwise classification regret, which in turn can be upper bounded using usual regret bounds for classification problems. Recently, Kotlowski et al. (2011) showed regret bounds for bipartite ranking in terms of the regret associated with balanced versions of the standard (non-pairwise) logistic and exponential losses. In this paper, we show that such (non-pairwise) surrogate regret bounds for bipartite ranking can be obtained in terms of a broad class of proper (composite) losses that we term as strongly proper. Our proof technique is much simpler than that of Kotlowski et al. (2011), and relies on properties of proper (composite) losses as elucidated recently by Reid and Williamson (2010, 2011) and others. Our result yields explicit surrogate bounds (with no hidden balancing terms) in terms of a variety of strongly proper losses, including for example logistic, exponential, squared and squared hinge losses as special cases. We also obtain tighter surrogate bounds under certain low-noise conditions via a recent result of Clemencon and Robbiano (2011).


Infinite Hidden Relational Models

arXiv.org Artificial Intelligence

In many cases it makes sense to model a relationship symmetrically, not implying any particular directionality. Consider the classical example of a recommendation system where the rating of an item by a user should symmetrically be dependent on the attributes of both the user and the item. The attributes of the (known) relationships are also relevant for predicting attributes of entities and for predicting attributes of new relations. In recommendation systems, the exploitation of relational attributes is often referred to as collaborative filtering. Again, in many applications one might prefer to model the collaborative effect in a symmetrical way. In this paper we present a relational model, which is completely symmetrical. The key innovation is that we introduce for each entity (or object) an infinite-dimensional latent variable as part of a Dirichlet process (DP) model. We discuss inference in the model, which is based on a DP Gibbs sampler, i.e., the Chinese restaurant process. We extend the Chinese restaurant process to be applicable to relational modeling. Our approach is evaluated in three applications. One is a recommendation system based on the MovieLens data set. The second application concerns the prediction of the function of yeast genes/proteins on the data set of KDD Cup 2001 using a multi-relational model. The third application involves a relational medical domain. The experimental results show that our model gives significantly improved estimates of attributes describing relationships or entities in complex relational models.


The Nonparanormal SKEPTIC

arXiv.org Machine Learning

We propose a semiparametric approach, named nonparanormal skeptic, for estimating high dimensional undirected graphical models. In terms of modeling, we consider the nonparanormal family proposed by Liu et al (2009). In terms of estimation, we exploit nonparametric rank-based correlation coefficient estimators including the Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This result suggests that the nonparanormal graphical models are a safe replacement of the Gaussian graphical models, even when the data are Gaussian.


An Efficient Approach to Sparse Linear Discriminant Analysis

arXiv.org Machine Learning

We present a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our proposal, is based on penalized Optimal Scoring. It has an exact equivalence with penalized LDA, contrary to the multi-class approaches based on the regression of class indicator that have been proposed so far. Sparsity is obtained thanks to a group-Lasso penalty that selects the same features in all discriminant directions. Our experiments demonstrate that this approach generates extremely parsimonious models without compromising prediction performances. Besides prediction, the resulting sparse discriminant directions are also amenable to low-dimensional representations of data. Our algorithm is highly efficient for medium to large number of variables, and is thus particularly well suited to the analysis of gene expression data.