Goto

Collaborating Authors

 Performance Analysis


Balancing Lifetime and Classification Accuracy of Wireless Sensor Networks

arXiv.org Machine Learning

Wireless sensor networks are composed of distributed sensors that can be used for signal detection or classification. The likelihood functions of the hypotheses are often not known in advance, and decision rules have to be learned via supervised learning. A specific such algorithm is Fisher discriminant analysis (FDA), the classification accuracy of which has been previously studied in the context of wireless sensor networks. Previous work, however, does not take into account the communication protocol or battery lifetime of the sensor networks; in this paper we extend the existing studies by proposing a model that captures the relationship between battery lifetime and classification accuracy. In order to do so we combine the FDA with a model that captures the dynamics of the Carrier-Sense Multiple-Access (CSMA) algorithm, the random-access algorithm used to regulate communications in sensor networks. This allows us to study the interaction between the classification accuracy, battery lifetime and effort put towards learning, as well as the impact of the back-off rates of CSMA on the accuracy. We characterize the tradeoff between the length of the training stage and accuracy, and show that accuracy is non-monotone in the back-off rate due to changes in the training sample size and overfitting.


Learning a peptide-protein binding affinity predictor with kernel ridge regression

arXiv.org Machine Learning

We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it's approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of accurately predicting the binding affinity of any peptide to any protein. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. On all benchmarks, our method significantly (p-value < 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. The method should be of value to a large segment of the research community with the potential to accelerate peptide-based drug and vaccine development.


High Dimensional Semiparametric Gaussian Copula Graphical Models

arXiv.org Machine Learning

In this paper, we propose a semiparametric approach, named nonparanormal skeptic, for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider Gaussian Copula graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This celebrating result suggests that the Gaussian copula graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare different estimators for their graph recovery performance under both ideal and noisy settings. The proposed methods are then applied on a large-scale genomic dataset to illustrate their empirical usefulness. The R language software package huge implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran. r-project.org/.


Identifying Users From Their Rating Patterns

arXiv.org Machine Learning

This paper reports on our analysis of the 2011 CAMRa Challenge dataset (Track 2) for context-aware movie recommendation systems. The train dataset comprises 4,536,891 ratings provided by 171,670 users on 23,974$ movies, as well as the household groupings of a subset of the users. The test dataset comprises 5,450 ratings for which the user label is missing, but the household label is provided. The challenge required to identify the user labels for the ratings in the test set. Our main finding is that temporal information (time labels of the ratings) is significantly more useful for achieving this objective than the user preferences (the actual ratings). Using a model that leverages on this fact, we are able to identify users within a known household with an accuracy of approximately 96% (i.e. misclassification rate around 4%).


Cost-Sensitive Risk Stratification in the Diagnosis of Heart Disease

AAAI Conferences

We investigate machine learning methods for diagnostic screening of heart disease. Coronary heart disease is the leading cause of death in the US, causing more deaths than all types of cancers combined. Early diagnosis of heart disease in women is harder than it is in men and typically requires the administration of several clinical tests on the patient. Most risk stratification methods aggregate the results of such tests, including the risky, invasive procedures that cannot be administered on all patients. In this paper, our goal is to identify patients who are under high-risk of having heart disease and related adverse events, using a minimal number of diagnostic tests, especially less invasive ones. The low frequency of patients with severe heart disease in the dataset is challenging for most conventional machine learning methods. To overcome this problem, we develop and apply a cost-sensitive k nearest neighbor algorithm. Our contributions are two fold: First, we compare the predictive value of several diagnostic procedures for heart disease, including electrocardiography, angiography, radionuclide perfusion and conclude that in womens heart disease, certain combinations of non-invasive techniques are more predictive than some of the widely used invasive procedures. Then, we evaluate held out data and achieve an AUROC over 0.70, signifying valuable clinical utility, using only the least costly and least invasive tests.


Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

AAAI Conferences

In crowdsourced relevance judging, each crowd workertypically judges only a small number of examples,yielding a sparse and imbalanced set of judgments inwhich relatively few workers influence output consensuslabels, particularly with simple consensus methodslike majority voting. We show how probabilistic matrixfactorization, a standard approach in collaborative filtering,can be used to infer missing worker judgments suchthat all workers influence output labels. Given completeworker judgments inferred by PMF, we evaluate impactin unsupervised and supervised scenarios. In thesupervised case, we consider both weighted voting andworker selection strategies based on worker accuracy.Experiments on a synthetic data set and a real turk dataset with crowd judgments from the 2010 TREC RelevanceFeedback Track show promise of the PMF approachmerits further investigation and analysis.


A Robust Planning Framework for Cognitive Robots

AAAI Conferences

A cognitive robot should construct a plan to attain its goals. While it executes the actions in its plan, it may face several failures due to both internal and external issues. We present a taxonomy to classify these failures that may be encountered during the execution of cognitive tasks. The taxonomy presents a wide range of failure types. To recover from most of these failures presented in this taxonomy, we propose a Robust Planning Framework for cognitive robots. Our framework combines planning, reasoning and learning procedures into each other for robust execution of cognitive tasks. Failures can be detected and handled by reasoning and replanning, respectively. The framework also facilitates learning new hypotheses incrementally based on experience. It can successfully detect and recover from temporary failures on a selected set of actions executed by a Pioneer3DX robot. It has been shown that our preliminary results for hypothesis learning in failure scenarios are promising.


Random Projection with Filtering for Nearly Duplicate Search

AAAI Conferences

High dimensional nearest neighbor search is a fundamental problem and has found applications in many domains. Although many hashing based approaches have been proposed for approximate nearest neighbor search in high dimensional space, one main drawback is that they often return many false positives that need to be filtered out by a post procedure. We propose a novel method to address this limitation in this paper. The key idea is to introduce a filtering procedure within the search algorithm, based on the compressed sensing theory, that effectively removes the false positive answers. We first obtain a sparse representation for each data point by the landmark based approach, after which we solve the nearly duplicate search that the difference between the query and its nearest neighbors forms a sparse vector living in a small ℓp ball, where p ≤ 1. Our empirical study on real-world datasets demonstrates the effectiveness of the proposed approach compared to the state-of-the-art hashing methods.


Predictive Mining of Comparable Entities from the Web

AAAI Conferences

Comparing entities is an important part of decision making. Several approaches have been reported for mining comparable entities from Web sources to improve user experience in comparing entities online.However, these efforts extract only entities explicitly compared in the corpora, and may exclude entities that occur less-frequently but potentially comparable. To build a more complete comparison machine that can infer such missing relations, here we develop a solutionto predict transitivity of known comparable relations. Named CliqueGrow, our approach predicts missing links given a comparable entity graph obtained from versus query logs. Our approach achieved the highest F1-score among five link prediction approaches and a commercial comparison engine provided by Yahoo!.


The Impact of Personalization on Smartphone-Based Activity Recognition

AAAI Conferences

Smartphones incorporate many diverse and powerful sensors, which creates exciting new opportunities for data mining and human-computer interaction. In this paper we show how standard classification algorithms can use labeled smartphone-based accelerometer data to identify the physical activity a user is performing. Our main focus is on evaluating the relative performance of impersonal and personal activity recognition models. Our impersonal (i.e., universal) models are built using training data from a panel of users and are then applied to new users, while our personal models are built with data from each user and then applied only to new data from that user. Our results indicate that the personal models perform dramatically better than the impersonal models—even when trained from only a few minutes worth of data. These personal models typically even outperform hybrid models that utilize both personal and impersonal data. These results strongly argue for the construction of personal models whenever possible. Our research means that we can unobtrusively gain useful knowledge about the habits of potentially millions of users. It also means that we can facilitate human computer interaction by enabling the smartphone to consider context and this can lead to new and more effective applications.