Goto

Collaborating Authors

 Performance Analysis


Localized Algorithm of Community Detection on Large-Scale Decentralized Social Networks

arXiv.org Machine Learning

Despite the overwhelming success of the existing Social Networking Services (SNS), their centralized ownership and control have led to serious concerns in user privacy, censorship vulnerability and operational robustness of these services. To overcome these limitations, Distributed Social Networks (DSN) have recently been proposed and implemented. Under these new DSN architectures, no single party possesses the full knowledge of the entire social network. While this approach solves the above problems, the lack of global knowledge for the DSN nodes makes it much more challenging to support some common but critical SNS services like friends discovery and community detection. In this paper, we tackle the problem of community detection for a given user under the constraint of limited local topology information as imposed by common DSN architectures. By considering the Personalized Page Rank (PPR) approach as an ink spilling process, we justify its applicability for decentralized community detection using limited local topology information.Our proposed PPR-based solution has a wide range of applications such as friends recommendation, targeted advertisement, automated social relationship labeling and sybil defense. Using data collected from a large-scale SNS in practice, we demonstrate our adapted version of PPR can significantly outperform the basic PR as well as two other commonly used heuristics. The inclusion of a few manually labeled friends in the Escape Vector (EV) can boost the performance considerably (64.97% relative improvement in terms of Area Under the ROC Curve (AUC)).


Fully scalable online-preprocessing algorithm for short oligonucleotide microarray atlases

arXiv.org Machine Learning

Accumulation of standardized data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of contemporary microarray collections. While short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level preprocessing algorithms have been available only for few measurement platforms based on pre-calculated model parameters from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm that provides tools to process large microarray atlases including tens of thousands of arrays. Unlike the alternatives, the proposed algorithm scales up in linear time with respect to sample size and is readily applicable to all short oligonucleotide platforms. This is the only available preprocessing algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small, consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray data collections. Moreover, using the most comprehensive data collections to estimate probe-level effects can assist in pinpointing individual probes affected by various biases and provide new tools to guide array design and quality control. The implementation is freely available in R/Bioconductor at http://www.bioconductor.org/packages/devel/bioc/html/RPA.html


Exponentially Weighted Moving Average Charts for Detecting Concept Drift

arXiv.org Machine Learning

Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.


Automatic post-picking using MAPPOS improves particle image detection from Cryo-EM micrographs

arXiv.org Machine Learning

Cryo-electron microscopy (cryo-EM) studies using single particle reconstruction are extensively used to reveal structural information on macromolecular complexes. Aiming at the highest achievable resolution, state of the art electron microscopes automatically acquire thousands of high-quality micrographs. Particles are detected on and boxed out from each micrograph using fully- or semi-automated approaches. However, the obtained particles still require laborious manual post-picking classification, which is one major bottleneck for single particle analysis of large datasets. We introduce MAPPOS, a supervised post-picking strategy for the classification of boxed particle images, as additional strategy adding to the already efficient automated particle picking routines. MAPPOS employs machine learning techniques to train a robust classifier from a small number of characteristic image features. In order to accurately quantify the performance of MAPPOS we used simulated particle and non-particle images. In addition, we verified our method by applying it to an experimental cryo-EM dataset and comparing the results to the manual classification of the same dataset. Comparisons between MAPPOS and manual post-picking classification by several human experts demonstrated that merely a few hundred sample images are sufficient for MAPPOS to classify an entire dataset with a human-like performance. MAPPOS was shown to greatly accelerate the throughput of large datasets by reducing the manual workload by orders of magnitude while maintaining a reliable identification of non-particle images.


Fast nonparametric classification based on data depth

arXiv.org Machine Learning

A new procedure, called DDa-procedure, is developed to solve the problem of classifying d-dimensional objects into q >= 2 classes. The procedure is completely nonparametric; it uses q-dimensional depth plots and a very efficient algorithm for discrimination analysis in the depth space [0,1]^q. Specifically, the depth is the zonoid depth, and the algorithm is the alpha-procedure. In case of more than two classes several binary classifications are performed and a majority rule is applied. Special treatments are discussed for 'outsiders', that is, data having zero depth vector. The DDa-classifier is applied to simulated as well as real data, and the results are compared with those of similar procedures that have been recently proposed. In most cases the new procedure has comparable error rates, but is much faster than other classification approaches, including the SVM.


Learning with Scope, with Application to Information Extraction and Classification

arXiv.org Machine Learning

In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features--- a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.


An Empirical Comparison of V-fold Penalisation and Cross Validation for Model Selection in Distribution-Free Regression

arXiv.org Machine Learning

Model selection is a crucial issue in machine-learning and a wide variety of penalisation methods (with possibly data dependent complexity penalties) have recently been introduced for this purpose. However their empirical performance is generally not well documented in the literature. It is the goal of this paper to investigate to which extent such recent techniques can be successfully used for the tuning of both the regularisation and kernel parameters in support vector regression (SVR) and the complexity measure in regression trees (CART). This task is traditionally solved via V-fold cross-validation (VFCV), which gives efficient results for a reasonable computational cost. A disadvantage however of VFCV is that the procedure is known to provide an asymptotically suboptimal risk estimate as the number of examples tends to infinity. Recently, a penalisation procedure called V-fold penalisation has been proposed to improve on VFCV, supported by theoretical arguments. Here we report on an extensive set of experiments comparing V-fold penalisation and VFCV for SVR/CART calibration on several benchmark datasets. We highlight cases in which VFCV and V-fold penalisation provide poor estimates of the risk respectively and introduce a modified penalisation technique to reduce the estimation error.


Making Early Predictions of the Accuracy of Machine Learning Applications

arXiv.org Artificial Intelligence

The accuracy of machine learning systems is a widely studied research topic. Established techniques such as cross-validation predict the accuracy on unseen data of the classifier produced by applying a given learning method to a given training data set. However, they do not predict whether incurring the cost of obtaining more data and undergoing further training will lead to higher accuracy. In this paper we investigate techniques for making such early predictions. We note that when a machine learning algorithm is presented with a training set the classifier produced, and hence its error, will depend on the characteristics of the algorithm, on training set's size, and also on its specific composition. In particular we hypothesise that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behaviour may be predictable. We test our hypothesis by building models that, given a measurement taken from the classifier created from a limited number of samples, predict the values that would be measured from the classifier produced when the full data set is presented. We create separate models for bias, variance and total error. Our models are built from the results of applying ten different machine learning algorithms to a range of data sets, and tested with "unseen" algorithms and datasets. We analyse the results for various numbers of initial training samples, and total dataset sizes. Results show that our predictions are very highly correlated with the values observed after undertaking the extra training. Finally we consider the more complex case where an ensemble of heterogeneous classifiers is trained, and show how we can accurately estimate an upper bound on the accuracy achievable after further training.


Evaluating Classifiers Without Expert Labels

arXiv.org Machine Learning

Machine Learning manuscript No. (will be inserted by the editor) Abstract This paper considers the challenge of evaluating a set of classifiers, as done in shared task evaluations like the KDD Cup or NIST TREC, without expert labels. While expert labels provide the traditional cornerstone for evaluating statistical learners, limited or expensive access to experts represents a practical bottleneck. Instead, we seek methodology for estimating performance of the classifiers (relative and absolute) which is more scalable than expert labeling yet preserves high correlation with evaluation based on expert labels. We consider both: 1) using only labels automatically generated by the classifiers themselves (blind evaluation); and 2) using labels obtained via crowdsourcing. While crowdsourcing methods are lauded for scalability, using such data for evaluation raises serious concerns given the prevalence of label noise. In regard to blind evaluation, two broad strategies are investigated: combine & score and score & combine. Combine & Score methods infer a single "pseudo-gold" label set by aggregating classifier labels; classifiers are then evaluated based on this single pseudo-gold label set. On the other hand, score & combine methods: i) sample multiple label sets from classifier outputs, ii) evaluate classifiers on each label set, and iii) average classifier performance across label sets. When additional crowd labels are also collected, we investigate two alternative avenues for exploiting them: 1) direct evaluation of classifiers; or 2) supervision of combine-and-score methods. To assess generality of our techniques, classifier performance is measured using four common classification metrics, with statistical significance tests establishing relative performance of the classifiers for each metric. Finally, we measure both score and rank correlations between estimated classifier performance vs. actual performance according to expert judgments. Rigorous evaluation of classifiers from the TREC 2011 Crowdsourcing Track shows reliable evaluation can be achieved without reliance on expert labels.


Network Sampling: From Static to Streaming Graphs

arXiv.org Machine Learning

Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms.