AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Regression Phalanxes

Zhang, Hongyang, Welch, William J., Zamar, Ruben H.

arXiv.org Machine LearningJul-3-2017

Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensional data sets using hierarchical clustering and builds a prediction model for each phalanx for further ensembling. Through extensive simulation studies and several real-life applications in various areas (including drug discovery, chemical analysis of spectra data, microarray analysis and climate projections) we show that an ensemble of Regression Phalanxes improves prediction accuracy when combined with effective prediction methods like Lasso or Random Forests.

artificial intelligence, initial group, machine learning, (16 more...)

arXiv.org Machine Learning

1707.00727

Country: North America (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Multi-rank Sparse Hierarchical Clustering

Zhang, Hongyang, Zamar, Ruben H.

arXiv.org Machine LearningJul-3-2017

There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better coverage of clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse hierarchical clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain clustering features with complex structure. In this paper, we propose the Multi-rank sparse hierarchical clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and clustering performance comparing to the classical (of-the-shelf) hierarchical clustering and the existing sparse hierarchical clustering framework.

artificial intelligence, machine learning, mrshc, (13 more...)

arXiv.org Machine Learning

1409.0745

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Location Dependent Dirichlet Processes

Sun, Shiliang, Paisley, John, Liu, Qiuyang

arXiv.org Machine LearningJul-2-2017

Dirichlet processes (DP) are widely applied in Bayesian nonparametric modeling. However, in their basic form they do not directly integrate dependency information among data arising from space and time. In this paper, we propose location dependent Dirichlet processes (LDDP) which incorporate nonparametric Gaussian processes in the DP modeling framework to model such dependencies. We develop the LDDP in the context of mixture modeling, and develop a mean field variational inference algorithm for this mixture model. The effectiveness of the proposed modeling framework is shown on an image segmentation task.

artificial intelligence, dirichlet process, machine learning, (16 more...)

arXiv.org Machine Learning

1707.0026

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Build PMML-based Applications and Generate Predictions in AWS Amazon Web Services

#artificialintelligenceJun-30-2017, 19:00:32 GMT

If you generate machine learning (ML) models, you know that the key challenge is exporting and importing them into other frameworks to separate model generation and prediction. Many applications use PMML (Predictive Model Markup Language) to move ML models from one framework to another. PMML is an XML representation of a data mining model. In this post, I show how to build a PMML application on AWS. First, you build a PMML model in Apache Spark using Amazon EMR.

artificial intelligence, machine learning, pmml model, (13 more...)

#artificialintelligence

Country:

Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)

Industry:

Retail > Online (0.40)
Information Technology > Services (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D

Grønlund, Allan, Larsen, Kasper Green, Mathiasen, Alexander, Nielsen, Jesper Sindahl, Schneider, Stefan, Song, Mingzhou

arXiv.org Artificial IntelligenceJun-30-2017

The $k$-Means clustering problem on $n$ points is NP-Hard for any dimension $d\ge 2$, however, for the 1D case there exist exact polynomial time algorithms. Previous literature reported an $O(kn^2)$ time dynamic programming algorithm that uses $O(kn)$ space. We present a new algorithm computing the optimal clustering in only $O(kn)$ time using linear space. For $k = \Omega(\lg n)$, we improve this even further to $n 2^{O(\sqrt{ \lg \lg n \lg k})}$ time. We generalize the new algorithm(s) to work for the absolute distance instead of squared distance and to work for any Bregman Divergence as well.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

1701.07204

Country: North America > United States (0.93)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.52)

Add feedback

Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

Mikalsen, Karl Øyvind, Bianchi, Filippo Maria, Soguero-Ruiz, Cristina, Jenssen, Robert

arXiv.org Machine LearningJun-29-2017

Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.

data mining, machine learning, pattern recognition, (20 more...)

arXiv.org Machine Learning

1704.00794

Country:

Europe (0.68)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.93)
(3 more...)

Add feedback

Node Embedding via Word Embedding for Network Community Discovery

Ding, Weicong, Lin, Christy, Ishwar, Prakash

arXiv.org Machine LearningJun-28-2017

EARNING a representation for nodes in a graph, also known as node embedding, has been an important tool for extracting features that can be used in machine learning problems involving graph-structured data [1]-[4]. Perhaps the most widely adopted node embedding is the one based on the eigendecomposition of the adjacency matrix or the graph Laplacian [2], [5], [6]. Recent advances in word embeddings for natural language processing such as [7] has inspired the development of analogous embeddings for nodes in graphs [3], [8]. These so-called "neural" node embeddings have been applied to a number of supervised learning problems such us link prediction and node classification and demonstrated stateof-the-art performance [3], [4], [8]. In contrast to applications to supervised learning problems in graphs, in this work we leverage the neural embedding framework to develop an algorithm for the unsupervised community discovery problem in graphs [9]-[12]. The key idea is straightforward: learn node embeddings such that vectors of similar nodes are close to each other in the latent embedding space. Then, the problem of discovering communities in a graph can be solved by finding clusters in the embedding space. We focus on non-overlapping communities and validate the performance of the new approach through a comprehensive set of experiments on both synthetic and real-world data.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1611.03028

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry: Education > Focused Education > Special Education (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

Wang, Shusen, Gittens, Alex, Mahoney, Michael W.

arXiv.org Machine LearningJun-24-2017

Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nystr\"om method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}{\epsilon} (1 + o(1))$ features constructed using a so-called rank-restricted Nystr\"om approximation results in cluster assignments that satisfy a $1 + \epsilon$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nystr\"om method. As part of the analysis, this work establishes a novel $1 + \epsilon$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nystr\"om approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nystr\"om approximation. This work argues that spectral clustering using Nystr\"om approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nystr\"om approximation. The superior performance of the latter approach is empirically verified.

approximation, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1706.02803

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.92)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Query Complexity of Clustering with Side Information

Mazumdar, Arya, Saha, Barna

arXiv.org Machine LearningJun-23-2017

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, "do two elements $u$ and $v$ belong to the same cluster?". The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

algorithm, probability, query, (12 more...)

arXiv.org Machine Learning

1706.07719

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.05)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

The NOESIS Network-Oriented Exploration, Simulation, and Induction System

Martínez, Víctor, Berzal, Fernando, Cubero, Juan-Carlos

arXiv.org Artificial IntelligenceJun-23-2017

Data mining techniques are intended to extract information from large volumes of data (Tan et al., 2006). Data mining includes tasks such as classification, regression, clustering, or anomaly detection, among others. Traditional data mining techniques are typically applied to tabulated data. Novel techniques have also been devised for semi-structured or structured data, since exploiting the relationships among instances from a dataset leads to new research and development opportunities (Getoor and Diehl, 2005). For example, network data mining has been used to predict previously unknown protein interactions in protein-protein interaction networks (Martínez et al., 2014). It has also been used to study and predict future author collaborations and tendencies in co-authorship networks (Pavlov and Ichise, 2007). Different network mining techniques are used by popular internet search engines to rank the most relevant websites (Page et al., 1999). These are only some examples of the large number of applications of network data mining. There are many software tools that facilitate the analysis of networked data.

data mining, machine learning, node, (18 more...)

arXiv.org Artificial Intelligence

1611.0481

Country:

Europe (0.93)
North America > United States (0.46)

Genre: Research Report (0.70)

Industry: Information Technology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback