Performance Analysis
Searching for significant patterns in stratified data
Llinares-Lopez, Felipe, Papaxanthos, Laetitia, Bodenham, Dean, Borgwardt, Karsten
Significant pattern mining, the problem of finding itemsets that are significantly enriched in one class of objects, is statistically challenging, as the large space of candidate patterns leads to an enormous multiple testing problem. Recently, the concept of testability was proposed as one approach to correct for multiple testing in pattern mining while retaining statistical power. Still, these strategies based on testability do not allow one to condition the test of significance on the observed covariates, which severely limits its utility in biomedical applications. Here we propose a strategy and an efficient algorithm to perform significant pattern mining in the presence of categorical covariates with K states.
The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams
De Rosa, Rocco, Orabona, Francesco, Cesa-Bianchi, Nicolรฒ
Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless ---traditional tuning methods are problematic in streaming settings--- and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.
Multi-criteria Similarity-based Anomaly Detection using Pareto Depth Analysis
Hsiao, Ko-Jen, Xu, Kevin S., Calder, Jeff, Hero, Alfred O. III
We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. Similarity-based anomaly detection algorithms detect abnormally large amounts of similarity or dissimilarity, e.g.~as measured by nearest neighbor Euclidean distances between a test sample and the training samples. In many application domains there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such cases, multiple dissimilarity measures can be defined, including non-metric measures, and one can test for anomalies by scalarizing using a non-negative linear combination of them. If the relative importance of the different dissimilarity measures are not known in advance, as in many anomaly detection applications, the anomaly detection algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we propose a method for similarity-based anomaly detection using a novel multi-criteria dissimilarity measure, the Pareto depth. The proposed Pareto depth analysis (PDA) anomaly detection algorithm uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach is provably better than using linear combinations of the criteria and shows superior performance on experiments with synthetic and real data sets.
Neyman-Pearson Classification under High-Dimensional Settings
Zhao, Anqi, Feng, Yang, Wang, Lie, Tong, Xin
Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies.
Crime Prediction Based On Crime Types And Using Spatial And Temporal Criminal Hotspots
Almanie, Tahani, Mirza, Rsha, Lor, Elizabeth
This paper focuses on finding spatial and temporal criminal hotspots. It analyses two different real-world crimes datasets for Denver, CO and Los Angeles, CA and provides a comparison between the two datasets through a statistical analysis supported by several graphs. Then, it clarifies how we conducted Apriori algorithm to produce interesting frequent patterns for criminal hotspots. In addition, the paper shows how we used Decision Tree classifier and Naive Bayesian classifier in order to predict potential crime types. To further analyse crimes datasets, the paper introduces an analysis study by combining our findings of Denver crimes dataset with its demographics information in order to capture the factors that might affect the safety of neighborhoods. The results of this solution could be used to raise awareness regarding the dangerous locations and to help agencies to predict future crimes in a specific location within a particular time.
Extensions of stability selection using subsamples of observations and covariates
Beinrucker, Andre, Dogan, รrรผn, Blanchard, Gilles
Variable selection techniques aim at identifying such relevant covariates (for a review see Guyon, 2006). Usually, variable selection aims at one of two goals: to identify informative covariates in order to get scientific insight into the data and the process that generated the outcome; or to use the covariates identified as relevant in order to predict the outcome. In this work we primarily focus on the identification of informative covariates but also consider prediction results using real data. We consider variable selection (also called feature selection in computer science-related communities) as a part of the broader field of dimensionality reduction. Many variable selection methods share the common drawback of being unstable with respect to small changes of the data: if one estimates the set of relevant covariates on different sets of observations coming from the same source, the result can vary significantly.
Universal Approximation of Edge Density in Large Graphs
With the recent availability of much network data, such as world wide web, social networks, phone call networks, science collaboration graphs [1], [2], there is a renewed interest for the graph partitioning problem, especially for the automatic discovery of community structures in large networks [3], [4], [5]. Beyond clustering approaches, coclustering approaches aim at summarizing the relation between two entities in a many-to-many relationship. Such a relation can be represented as a graph, where the source and target vertices represent entities and the edges stand for relations between entities. A coclustering model provides a summary of a graph by grouping source vertices and target vertices. For example, in market analysis, the source vertices of the graph represent customers, the target vertices represent products and there is one edge each time a customer has purchased a product. A coclustering model summarizes the dataset by grouping customers that have purchased approximately the same products and grouping products that have been purchased by approximately the same customers. Coclustering models have been applied to many other domains, such as information retrieval (the entities are documents and their words in a text corpus), web log analysis (cookies and their visited web pages), web structure analysis (web pages with hyperlinks between them) or telecommunication network (the call detail records stand for the edges in a call graph between a caller and a called party). All these real-world graphs are directed multigraphs, meaning that two entities may be linked by multi-edges. We aim to summarize and discover insightful patterns in such graphs, using a method with the desired following properties: 1) Robustness, to avoid detecting spurious patterns in case of noisy data.
Identifying missing dictionary entries with frequency-conserving context models
Williams, Jake Ryland, Clark, Eric M., Bagrow, James P., Danforth, Christopher M., Dodds, Peter Sheridan
In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data, (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary---an extensive, online, collaborative, and open-source dictionary that contains over 100,000 phrasal-definitions---we develop highly effective filters for the identification of meaningful, missing phrase-entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique, and expanding our knowledge of the defined English lexicon of phrases.
Implicitly Constrained Semi-Supervised Least Squares Classification
Krijthe, Jesse H., Loog, Marco
We introduce a novel semi-supervised version of the least squares classifier. This implicitly constrained least squares (ICLS) classifier minimizes the squared loss on the labeled data among the set of parameters implied by all possible labelings of the unlabeled data. Unlike other discriminative semi-supervised methods, our approach does not introduce explicit additional assumptions into the objective function, but leverages implicit assumptions already present in the choice of the supervised least squares classifier. We show this approach can be formulated as a quadratic programming problem and its solution can be found using a simple gradient descent procedure. We prove that, in a certain way, our method never leads to performance worse than the supervised classifier. Experimental results corroborate this theoretical result in the multidimensional case on benchmark datasets, also in terms of the error rate.
Kernel convolution model for decoding sounds from time-varying neural responses
Faisal, Ali, Nora, Anni, Seol, Jaeho, Renvall, Hanna, Salmelin, Riitta
In this study we present a kernel based convolution model to characterize neural responses to natural sounds by decoding their time-varying acoustic features. The model allows to decode natural sounds from high-dimensional neural recordings, such as magnetoencephalography (MEG), that track timing and location of human cortical signalling noninvasively across multiple channels. We used the MEG responses recorded from subjects listening to acoustically different environmental sounds. By decoding the stimulus frequencies from the responses, our model was able to accurately distinguish between two different sounds that it had never encountered before with 70% accuracy. Convolution models typically decode frequencies that appear at a certain time point in the sound signal by using neural responses from that time point until a certain fixed duration of the response. Using our model, we evaluated several fixed durations (time-lags) of the neural responses and observed auditory MEG responses to be most sensitive to spectral content of the sounds at time-lags of 250 ms to 500 ms. The proposed model should be useful for determining what aspects of natural sounds are represented by high-dimensional neural responses and may reveal novel properties of neural signals.