AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Survey on Multi-View Clustering

Chao, Guoqing, Sun, Shiliang, Bi, Jinbo

arXiv.org Machine LearningDec-17-2017

Clustering [1] is a paradigm to classify the subjects into several groups based on their similarity information. As we know that clustering is a fundamental task in machine learning, pattern recognition and data mining fields and it has widespread applications. With the obtained groups by clustering methods, further analysis tasks can be conducted to achieve different ultimate goals. However, traditional clustering methods only use one feature set or one view information of the subjects while multiple feature sets or multiple view information of these subjects are available. The subjects of interest with multiple feature sets or multiple view information are the so called multi-view data. Multi-view data are very common in real-world applications due to the innate properties, or collecting from different sources. For instance, a web page can be described by the words appearing on the web page itself and the words underlying all links pointing to the web page from other pages in nature. In multimedia content understanding, the multimedia segments can be simultaneously described by their video signals from visual camera and audio signals from voice recorder devices. The existence of such multi-view data raised the interest of multi-view learning [2], [3], [4], which has been extensively studied in semi-supervised setting.

artificial intelligence, machine learning, proceedings, (14 more...)

arXiv.org Machine Learning

1712.06246

Country: North America > United States > Connecticut (0.28)

Genre: Overview (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Taming Wild High Dimensional Text Data with a Fuzzy Lash

Karami, Amir

arXiv.org Machine LearningDec-16-2017

The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy.

artificial intelligence, fuzzy logic, machine learning, (16 more...)

arXiv.org Machine Learning

1712.05997

Country:

North America > United States > South Carolina (0.28)
North America > United States > Maryland (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning

Teklehaymanot, Freweyni K., Muma, Michael, Zoubir, Abdelhak M.

arXiv.org Machine LearningDec-15-2017

We derive a new Bayesian Information Criterion (BIC) from first principles by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as an important milestone when deriving the BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enumeration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models. Subsequently, the optimal cluster number is determined as the one associated to the model for which the proposed BIC is maximal. The performance of the proposed criterion is tested using synthetic and real data sets. Despite the fact that the original BIC is a generic criterion which does not include information about the specific model selection problem at hand, it has been widely used in the literature to estimate the number of clusters in an observed data set. We, therefore, consider it as a benchmark comparison. Simulation results show that our proposed criterion outperforms the existing cluster enumeration methods that are based on the original BIC.

artificial intelligence, machine learning, penalty term, (17 more...)

arXiv.org Machine Learning

1710.07954

Country: Europe (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Wade , Ghahramani : Bayesian Cluster Analysis: Point Estimation and Credible Balls

#artificialintelligenceDec-14-2017, 23:10:05 GMT

Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to popular algorithms such as agglomerative hierarchical clustering or k-means which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficulties in visualizing it add to this problem. In a Bayesian analysis, the posterior of a real-valued parameter of interest is often summarized by reporting a point estimate such as the posterior mean along with 95% credible intervals to characterize uncertainty. In this paper, we extend these ideas to develop appropriate point estimates and credible sets to summarize the posterior of the clustering structure based on decision and information theoretic techniques.

artificial intelligence, machine learning, point estimation and credible ball, (6 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

ClustGeo: an R package for hierarchical clustering with spatial constraints

Chavent, Marie, Kuentz-Simonet, Vanessa, Labenne, Amaury, Saracco, Jérôme

arXiv.org Machine LearningDec-13-2017

In this paper, we propose a Ward-like hierarchical clustering algorithm including spatial/geographical constraints. Two dissimilarity matrices $D_0$ and $D_1$ are inputted, along with a mixing parameter $\alpha \in [0,1]$. The dissimilarities can be non-Euclidean and the weights of the observations can be non-uniform. The first matrix gives the dissimilarities in the "feature space" and the second matrix gives the dissimilarities in the "constraint space". The criterion minimized at each stage is a convex combination of the homogeneity criterion calculated with $D_0$ and the homogeneity criterion calculated with $D_1$. The idea is then to determine a value of $\alpha$ which increases the spatial contiguity without deteriorating too much the quality of the solution based on the variables of interest i.e. those of the feature space. This procedure is illustrated on a real dataset using the R package ClustGeo.

artificial intelligence, machine learning, partition, (19 more...)

arXiv.org Machine Learning

doi: 10.1007/s00180-018-0791-1

1707.03897

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Quantum Extension of Variational Bayes Inference

Miyahara, Hideyuki, Sughiyama, Yuki

arXiv.org Machine LearningDec-13-2017

Institute of Industrial Science, The University of Tokyo, 4-6-1, Komaba, Meguro-ku, Tokyo 153-8505, Japan (Dated: December 14, 2017) Variational Bayes (VB) inference is one of the most important algorithms in machine learning and widely used in engineering and industry. However, VB is known to suffer from the problem of local optima. In this Letter, we generalize VB by using quantum mechanics, and propose a new algorithm, which we call quantum annealing variational Bayes (QA VB) inference. We then show that QA VB drastically improve the performance of VB by applying them to a clustering problem described by a Gaussian mixture model. Finally, we discuss an intuitive understanding on how QA VB works well.

artificial intelligence, machine learning, qa vb, (15 more...)

arXiv.org Machine Learning

1712.04709

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.45)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Incremental Eigenpair Computation for Graph Laplacian Matrices: Theory and Applications

Chen, Pin-Yu, Zhang, Baichuan, Hasan, Mohammad Al

arXiv.org Machine LearningDec-13-2017

The smallest eigenvalues and the associated eigenvectors (i.e., eigenpairs) of a graph Laplacian matrix have been widely used in spectral clustering and community detection. However, in real-life applications the number of clusters or communities (say, $K$) is generally unknown a-priori. Consequently, the majority of the existing methods either choose $K$ heuristically or they repeat the clustering method with different choices of $K$ and accept the best clustering result. The first option, more often, yields suboptimal result, while the second option is computationally expensive. In this work, we propose an incremental method for constructing the eigenspectrum of the graph Laplacian matrix. This method leverages the eigenstructure of graph Laplacian matrix to obtain the $K$-th smallest eigenpair of the Laplacian matrix given a collection of all previously computed $K-1$ smallest eigenpairs. Our proposed method adapts the Laplacian matrix such that the batch eigenvalue decomposition problem transforms into an efficient sequential leading eigenpair computation problem. As a practical application, we consider user-guided spectral clustering. Specifically, we demonstrate that users can utilize the proposed incremental method for effective eigenpair computation and for determining the desired number of clusters based on multiple clustering metrics.

data mining, eigenpair, machine learning, (15 more...)

arXiv.org Machine Learning

1801.08196

Country: North America > United States (0.94)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Last, Felix, Douzas, Georgios, Bacao, Fernando

arXiv.org Machine LearningDec-12-2017

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.

artificial intelligence, machine learning, smote, (18 more...)

arXiv.org Machine Learning

1711.00837

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Outlier Detection by Consistent Data Selection Method

Porwal, Utkarsh, Mukund, Smruthi

arXiv.org Machine LearningDec-11-2017

Often the challenge associated with tasks like fraud and spam detection[1] is the lack of all likely patterns needed to train suitable supervised learning models. In order to overcome this limitation, such tasks are attempted as outlier or anomaly detection tasks. We also hypothesize that out- liers have behavioral patterns that change over time. Limited data and continuously changing patterns makes learning significantly difficult. In this work we are proposing an approach that detects outliers in large data sets by relying on data points that are consistent. The primary contribution of this work is that it will quickly help retrieve samples for both consistent and non-outlier data sets and is also mindful of new outlier patterns. No prior knowledge of each set is required to extract the samples. The method consists of two phases, in the first phase, consistent data points (non- outliers) are retrieved by an ensemble method of unsupervised clustering techniques and in the second phase a one class classifier trained on the consistent data point set is ap- plied on the remaining sample set to identify the outliers. The approach is tested on three publicly available data sets and the performance scores are competitive.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1712.04129

Country: North America > United States (0.68)

Genre: Research Report (0.50)

Industry:

Information Technology (0.47)
Law Enforcement & Public Safety (0.47)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Improving Malware Detection Accuracy by Extracting Icon Information

Silva, Pedro, Akhavan-Masouleh, Sepehr, Li, Li

arXiv.org Machine LearningDec-10-2017

Detecting PE malware files is now commonly approached using statistical and machine learning models. While these models commonly use features extracted from the structure of PE files, we propose that icons from these files can also help better predict malware. We propose an innovative machine learning approach to extract information from icons. Our proposed approach consists of two steps: 1) extracting icon features using summary statics, histogram of gradients (HOG), and a convolutional autoencoder, 2) clustering icons based on the extracted icon features. Using publicly available data and by using machine learning experiments, we show our proposed icon clusters significantly boost the efficacy of malware prediction models. In particular, our experiments show an average accuracy increase of 10% when icon clusters are used in the prediction model.

artificial intelligence, information, machine learning, (15 more...)

arXiv.org Machine Learning

1712.03483

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.47)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback