Goto

Collaborating Authors

 Clustering


Regression Phalanxes

arXiv.org Machine Learning

Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensional data sets using hierarchical clustering and builds a prediction model for each phalanx for further ensembling. Through extensive simulation studies and several real-life applications in various areas (including drug discovery, chemical analysis of spectra data, microarray analysis and climate projections) we show that an ensemble of Regression Phalanxes improves prediction accuracy when combined with effective prediction methods like Lasso or Random Forests.


Multi-rank Sparse Hierarchical Clustering

arXiv.org Machine Learning

There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better coverage of clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse hierarchical clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain clustering features with complex structure. In this paper, we propose the Multi-rank sparse hierarchical clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and clustering performance comparing to the classical (of-the-shelf) hierarchical clustering and the existing sparse hierarchical clustering framework.


Location Dependent Dirichlet Processes

arXiv.org Machine Learning

Dirichlet processes (DP) are widely applied in Bayesian nonparametric modeling. However, in their basic form they do not directly integrate dependency information among data arising from space and time. In this paper, we propose location dependent Dirichlet processes (LDDP) which incorporate nonparametric Gaussian processes in the DP modeling framework to model such dependencies. We develop the LDDP in the context of mixture modeling, and develop a mean field variational inference algorithm for this mixture model. The effectiveness of the proposed modeling framework is shown on an image segmentation task.


Build PMML-based Applications and Generate Predictions in AWS Amazon Web Services

#artificialintelligence

If you generate machine learning (ML) models, you know that the key challenge is exporting and importing them into other frameworks to separate model generation and prediction. Many applications use PMML (Predictive Model Markup Language) to move ML models from one framework to another. PMML is an XML representation of a data mining model. In this post, I show how to build a PMML application on AWS. First, you build a PMML model in Apache Spark using Amazon EMR.


Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D

arXiv.org Artificial Intelligence

The $k$-Means clustering problem on $n$ points is NP-Hard for any dimension $d\ge 2$, however, for the 1D case there exist exact polynomial time algorithms. Previous literature reported an $O(kn^2)$ time dynamic programming algorithm that uses $O(kn)$ space. We present a new algorithm computing the optimal clustering in only $O(kn)$ time using linear space. For $k = \Omega(\lg n)$, we improve this even further to $n 2^{O(\sqrt{ \lg \lg n \lg k})}$ time. We generalize the new algorithm(s) to work for the absolute distance instead of squared distance and to work for any Bregman Divergence as well.


Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

arXiv.org Machine Learning

Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.


Node Embedding via Word Embedding for Network Community Discovery

arXiv.org Machine Learning

EARNING a representation for nodes in a graph, also known as node embedding, has been an important tool for extracting features that can be used in machine learning problems involving graph-structured data [1]-[4]. Perhaps the most widely adopted node embedding is the one based on the eigendecomposition of the adjacency matrix or the graph Laplacian [2], [5], [6]. Recent advances in word embeddings for natural language processing such as [7] has inspired the development of analogous embeddings for nodes in graphs [3], [8]. These so-called "neural" node embeddings have been applied to a number of supervised learning problems such us link prediction and node classification and demonstrated stateof-the-art performance [3], [4], [8]. In contrast to applications to supervised learning problems in graphs, in this work we leverage the neural embedding framework to develop an algorithm for the unsupervised community discovery problem in graphs [9]-[12]. The key idea is straightforward: learn node embeddings such that vectors of similar nodes are close to each other in the latent embedding space. Then, the problem of discovering communities in a graph can be solved by finding clusters in the embedding space. We focus on non-overlapping communities and validate the performance of the new approach through a comprehensive set of experiments on both synthetic and real-world data.


Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

arXiv.org Machine Learning

Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nystr\"om method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}{\epsilon} (1 + o(1))$ features constructed using a so-called rank-restricted Nystr\"om approximation results in cluster assignments that satisfy a $1 + \epsilon$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nystr\"om method. As part of the analysis, this work establishes a novel $1 + \epsilon$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nystr\"om approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nystr\"om approximation. This work argues that spectral clustering using Nystr\"om approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nystr\"om approximation. The superior performance of the latter approach is empirically verified.


Query Complexity of Clustering with Side Information

arXiv.org Machine Learning

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, "do two elements $u$ and $v$ belong to the same cluster?". The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.


The NOESIS Network-Oriented Exploration, Simulation, and Induction System

arXiv.org Artificial Intelligence

Data mining techniques are intended to extract information from large volumes of data (Tan et al., 2006). Data mining includes tasks such as classification, regression, clustering, or anomaly detection, among others. Traditional data mining techniques are typically applied to tabulated data. Novel techniques have also been devised for semi-structured or structured data, since exploiting the relationships among instances from a dataset leads to new research and development opportunities (Getoor and Diehl, 2005). For example, network data mining has been used to predict previously unknown protein interactions in protein-protein interaction networks (Martínez et al., 2014). It has also been used to study and predict future author collaborations and tendencies in co-authorship networks (Pavlov and Ichise, 2007). Different network mining techniques are used by popular internet search engines to rank the most relevant websites (Page et al., 1999). These are only some examples of the large number of applications of network data mining. There are many software tools that facilitate the analysis of networked data.