Goto

Collaborating Authors

 Clustering


Thwarting DoS Attacks: A Framework for Detection based on Collective Anomalies and Clustering

IEEE Computer

A hybrid learning framework uses a collective anomaly to analyze patterns in denial-of-service attacks along with data clustering to distinguish an attack from normal network traffic. In two evaluation datasets, the framework achieved higher hit rates relative to existing anomaly-detection techniques. Mohiuddin Ahmed, "Thwarting DoS Attacks: A Framework for Detection based on Collective Anomalies and Clustering", Computer, vol.


Statistics Is Easy

@machinelearnbot

With today's software, statistics is easy, right? Even before the start of Data Mania, circa 2010, vendors have been suggesting that if we buy their easy-to-use statistical software, we don't really need to know what we're doing. Since then, hogwash about automated machine learning and "AI" has populated the blogosphere in great quantity. What should populate the blogosphere instead are the true horror stories about costly errors people with little background in statistics are making with this easy-to-use software. Over time, they may help, but typically these programs and courses cover a wide range of subjects superficially.


A Compressive Sensing Approach to Community Detection with Applications

arXiv.org Machine Learning

The community detection problem for graphs asks one to partition the n vertices V of a graph G into k communities, or clusters, such that there are many intracluster edges and few intercluster edges. Of course this is equivalent to finding a permutation matrix P such that, if A denotes the adjacency matrix of G, then PAP^T is approximately block diagonal. As there are k^n possible partitions of n vertices into k subsets, directly determining the optimal clustering is clearly infeasible. Instead one seeks to solve a more tractable approximation to the clustering problem. In this paper we reformulate the community detection problem via sparse solution of a linear system associated with the Laplacian of a graph G and then develop a two-stage approach based on a thresholding technique and a compressive sensing algorithm to find a sparse solution which corresponds to the community containing a vertex of interest in G. Crucially, our approach results in an algorithm which is able to find a single cluster of size n_0 in O(nlog(n)n_0) operations and all k clusters in fewer than O(n^2ln(n)) operations. This is a marked improvement over the classic spectral clustering algorithm, which is unable to find a single cluster at a time and takes approximately O(n^3) operations to find all k clusters. Moreover, we are able to provide robust guarantees of success for the case where G is drawn at random from the Stochastic Block Model, a popular model for graphs with clusters. Extensive numerical results are also provided, showing the efficacy of our algorithm on both synthetic and real-world data sets.


A Dirichlet Mixture Model of Hawkes Processes for Event Sequence Clustering

arXiv.org Machine Learning

We propose an effective method to solve the event sequence clustering problems based on a novel Dirichlet mixture model of a special but significant type of point processes --- Hawkes process. In this model, each event sequence belonging to a cluster is generated via the same Hawkes process with specific parameters, and different clusters correspond to different Hawkes processes. The prior distribution of the Hawkes processes is controlled via a Dirichlet distribution. We learn the model via a maximum likelihood estimator (MLE) and propose an effective variational Bayesian inference algorithm. We specifically analyze the resulting EM-type algorithm in the context of inner-outer iterations and discuss several inner iteration allocation strategies. The identifiability of our model, the convergence of our learning method, and its sample complexity are analyzed in both theoretical and empirical ways, which demonstrate the superiority of our method to other competitors. The proposed method learns the number of clusters automatically and is robust to model misspecification. Experiments on both synthetic and real-world data show that our method can learn diverse triggering patterns hidden in asynchronous event sequences and achieve encouraging performance on clustering purity and consistency.


Learning Mixtures of Multi-Output Regression Models by Correlation Clustering for Multi-View Data

arXiv.org Machine Learning

In many datasets, different parts of the data may have their own patterns of correlation, a structure that can be modeled as a mixture of local linear correlation models. The task of finding these mixtures is known as correlation clustering. In this work, we propose a linear correlation clustering method for datasets whose features are pre-divided into two views. The method, called Canonical Least Squares (CLS) clustering, is inspired by multi-output regression and Canonical Correlation Analysis. CLS clusters can be interpreted as variations in the regression relationship between the two views. The method is useful for data mining and data interpretation. Its utility is demonstrated on a synthetic dataset and stock market dataset.


Statistical inference on random dot product graphs: a survey

arXiv.org Machine Learning

The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference.


Co-Clustering Can Provide Industrial Data Pattern Discovery

#artificialintelligence

In spite of the rapid development in data acquisition technology resulting in the explosive collection of acquired datasets, techniques such as data organization and classification, manipulation, and analysis of very large, diverse, heterogeneous datasets have only evolved modestly. This has led to hindrances in effective utility and better understanding of the acquired, large-scale data for knowledge discovery. In an industrial setting, an interesting visual from McKinsey illustrates that despite collecting data from tens of thousands of sensors, less than 1% is actually utilized. Data clustering is the classification of data objects into different groups (clusters) such that data objects in one group are similar together and dissimilar from another group. Typically, homogeneous data objects, i.e. data objects having the same data type, are grouped together using some of the well-known clustering algorithms.


Subspace Clustering using Ensembles of $K$-Subspaces

arXiv.org Machine Learning

In modern computer vision problems such as facial recognition [1] and object tracking [2], researchers have found success applying the union of subspaces (UoS) model, in which data vectors lie near one of several subspaces. Under this model, the goal is to simultaneously identify these underlying subspaces and cluster the points according to their nearest subspace. Algorithms designed to solve this problem fall under the category of subspace clustering, a topic that has received a great deal of attention in recent years [3] due to its efficacy on real-world datasets such as the Extended Yale Face Database B [4] and the MNIST handwritten digit database [5]. One of the earliest approaches to solving the subspace clustering problem involves an iterative method in the spirit of K-means, known as K-subspaces (KSS) [6], [7], [8], which alternates between assigning points to clusters and estimating the subspace basis associated with each cluster. As this algorithm is only guaranteed to converge to a local minimum, in practice one runs many instances of the algorithm and chooses the final clustering as the one that produces the minimum cost. Although its empirical performance is limited, KSS continues to serve as a benchmark for subspace clustering algorithms, in part due to its computational efficiency and simplicity. Therefore, a deeper understanding of this method is an important contribution in the area of subspace clustering and a contribution of this paper. While the KSS cost function and alternating algorithm are perhaps the most natural approach for the subspace clustering problem, it is known that there is a set of initializations of nonzero measure from which the algorithm will convergence to a point other than the global minimizer.


A relevance-scalability-interpretability tradeoff with temporally evolving user personas

arXiv.org Machine Learning

The current work characterizes the users of a VoD streaming space through user-personas based on a tenure timeline and temporal behavioral features in the absence of explicit user profiles. A combination of tenure timeline and temporal characteristics caters to business needs of understanding the evolution and phases of user behavior as their accounts age. The personas constructed in this work successfully represent both dominant and niche characterizations while providing insightful maturation of user behavior in the system. The two major highlights of our personas are demonstration of stability along tenure timelines on a population level, while exhibiting interesting migrations between labels on an individual granularity and clear interpretability of user labels. Finally, we show a trade-off between an indispensable trio of guarantees, relevance-scalability-interpretability by using summary information from personas in a CTR (Click through rate) predictive model. The proposed method of uncovering latent personas, consequent insights from these and application of information from personas to predictive models are broadly applicable to other streaming based products.


Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data

arXiv.org Machine Learning

Flow cytometry is a high-throughput technology used to quantify multiple surface and intracellular markers at the level of a single cell. This enables to identify cell sub-types, and to determine their relative proportions. Improvements of this technology allow to describe millions of individual cells from a blood sample using multiple markers. This results in high-dimensional datasets, whose manual analysis is highly time-consuming and poorly reproducible. While several methods have been developed to perform automatic recognition of cell populations, most of them treat and analyze each sample independently. However, in practice, individual samples are rarely independent (e.g. longitudinal studies). Here, we propose to use a Bayesian nonparametric approach with Dirichlet process mixture (DPM) of multivariate skew $t$-distributions to perform model based clustering of flow-cytometry data. DPM models directly estimate the number of cell populations from the data, avoiding model selection issues, and skew $t$-distributions provides robustness to outliers and non-elliptical shape of cell populations. To accommodate repeated measurements, we propose a sequential strategy relying on a parametric approximation of the posterior. We illustrate the good performance of our method on simulated data, on an experimental benchmark dataset, and on new longitudinal data from the DALIA-1 trial which evaluates a therapeutic vaccine against HIV. On the benchmark dataset, the sequential strategy outperforms all other methods evaluated, and similarly, leads to improved performance on the DALIA-1 data. We have made the method available for the community in the R package NPflow.