AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers

Wei, Yuanyuan, Jang-Jaccard, Julian, Sabrina, Fariza, McIntosh, Timothy

arXiv.org Machine LearningOct-15-2019

Outlier detection is a technique in data mining that aims to detect unusual or unexpected records in the dataset. Existing outlier detection algorithms have different pros and cons and exhibit different sensitivity to noisy data such as extreme values. In this paper, we propose a novel cluster-based outlier detection algorithm named MSD-Kmeans that combines the statistical method of Mean and Standard Deviation (MSD) and the machine learning clustering algorithm K-means to detect outliers more accurately with the better control of extreme values. There are two phases in this combination method of MSD-Kmeans: (1) applying MSD algorithm to eliminate as many noisy data to minimize the interference on clusters, and (2) applying K-means algorithm to obtain local optimal clusters. We evaluate our algorithm and demonstrate its effectiveness in the context of detecting possible overcharging of taxi fares, as greedy dishonest drivers may attempt to charge high fares by detouring. We compare the performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, K-means, Z-score, MIQR and LOF, and prove that the proposed MSD-Kmeans algorithm achieves the highest measure of precision, accuracy, and F-measure. We conclude that MSD-Kmeans can be used for effective and efficient outlier detection on data of varying quality on IoT devices.

algorithm, msd-kmean, outlier, (14 more...)

arXiv.org Machine Learning

1910.06588

Country:

Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Oceania > Australia > Queensland (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.90)
Transportation > Passenger (0.68)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

MIM: Mutual Information Machine

Livne, Micha, Swersky, Kevin, Fleet, David J.

arXiv.org Machine LearningOct-14-2019

We introduce the Mutual Information Machine (MIM), an autoencoder model for learning joint distributions over observations and latent states. The model formulation reflects two key design principles: 1) symmetry, to encourage the encoder and decoder to learn consistent factorizations of the same underlying distribution; and 2) mutual information, to encourage the learning of useful representations for downstream tasks. The objective comprises the Jensen-Shannon divergence between the encoding and decoding joint distributions, plus a mutual information term. We show that this objective can be bounded by a tractable cross-entropy loss between the true model and a parameterized approximation, and relate this to maximum likelihood estimation and variational autoencoders. Experiments show that MIM is capable of learning a latent representation with high mutual information, and good unsupervised clustering, while providing data log likelihood comparable to VAE (with a sufficiently expressive architecture).

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

1910.03175

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Supervised Encoding for Discrete Representation Learning

Le, Cat P., Zhou, Yi, Ding, Jie, Tarokh, Vahid

arXiv.org Machine LearningOct-14-2019

Classical supervised classification tasks search for a nonlinear mapping that maps each encoded feature directly to a probability mass over the labels. Such a learning framework typically lacks the intuition that encoded features from the same class tend to be similar and thus has little interpretability for the learned features. In this paper, we propose a novel supervised learning model named Supervised-Encoding Quantizer (SEQ). The SEQ applies a quantizer to cluster and classify the encoded features. We found that the quantizer provides an interpretable graph where each cluster in the graph represents a class of data samples that have a particular style. We also trained a decoder that can decode convex combinations of the encoded features from similar and different clusters and provide guidance on style transfer between sub-classes.

data sample, encoder, quantizer, (16 more...)

arXiv.org Machine Learning

1910.11067

Country:

North America > United States > Utah (0.04)
North America > United States > Minnesota (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

Hassani, Ali, Iranmanesh, Amir, Eftekhari, Mahdi, Salemi, Abbas

arXiv.org Machine LearningOct-14-2019

As one of the most ubiquitously applied unsupervised learning methods, clustering has also been known to have a few disadvantages. More specifically, parameters such as the number of clusters and neighborhood radius are what call the `unsupervised` nature of these algorithms into question. Moreover, the stochastic nature of a great number of these algorithms is also a considerable point of weakness. In order to address these issues, we propose DISCERN which can serve as an initialization algorithm for K-Means, finding suitable centroids that increase the performance of K-Means. Following that, the algorithm can estimate the number of clusters if need be. The algorithm does all of that, while maintaining complete robustness and returning the same results at each separate run. We ran experiments on the proposed method processing multiple datasets and the results show its undeniable superiority in terms of results, computational time and robustness when compared to the randomized K-Means and K-Means++ initialization. In addition, the superiority in estimating the number of clusters is also discussed and we prove the lower complexity when compared to methods such as the elbow and silhouette methods in estimating the number of clusters.

algorithm, centroid, discern, (15 more...)

arXiv.org Machine Learning

1910.05933

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
(9 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Unsupervised Discovery of Sparse Multimodal Representations in High Dimensional Data

Melton, Samuel, Ramanathan, Sharad

arXiv.org Machine LearningOct-13-2019

Extracting an understanding of the underlying system from high dimensional data is a growing problem in science. One primary challenge involves clustering the data points into classes with interpretable differences. A challenge in clustering high dimensional data is that multimodal signatures that define clusters may only be present in a small but unknown subspace. Discovering the subspace that defines clusters provides low dimensional representations of the data which capture cluster diversity, and provides greater understanding of the system by identifying the key underlying variables. Here, we define a class of problems in which linear separability of clusters is hidden in a low dimensional space. We propose an unsupervised model-free method to identify the subset of features that define a low dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low dimensional subspace. This method can be used as a wrapper for existing clustering algorithms to find low dimensional informative subspaces with multimodal signatures within high dimensional data.

dimension, subspace, transcription factor, (15 more...)

arXiv.org Machine Learning

1910.05814

Country:

North America > United States > Ohio > Montgomery County > Dayton (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

K-Means Clustering for Unsupervised Machine Learning

#artificialintelligenceOct-12-2019, 08:59:47 GMT

Artificial Intelligence (AI) and Machine Learning (ML) have revolutionized every aspect of our life and disrupted how we do business, unlike any other technology in the the history of mankind. Such disruption brings many challenges for professionals and businesses. In this article, I will provide an introduction to one of the most commonly used machine learning methods, K-Means. Machine learning is a scientific method that utilizes statistical methods along with the computational power of machines to convert data to wisdom that humans or the machine itself can use for taking certain actions. "It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."

cluster center, k-means, learning, (13 more...)

#artificialintelligence

Genre: Overview (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Residual Encoder-Decoder Network for Deep Subspace Clustering

Yang, Shuai, Zhu, Wenqi, Zhu, Yuesheng

arXiv.org Machine LearningOct-12-2019

Subspace clustering aims to cluster unlabeled data that lies in a union of low-dimensional linear subspaces. Deep subspace clustering approaches based on auto-encoders have become very popular to solve subspace clustering problems. However, the training of current deep methods converges slowly, which is much less efficient than traditional approaches. We propose a Residual Encoder-Decoder network for deep Subspace Clustering (RED-SC), which symmetrically links convolutional and deconvolutional layers with skip-layer connections, with which the training converges much faster. We use a self-expressive layer to generate more accurate linear representation coefficients through different latent representations from multiple latent spaces. Experiments show the superiority of RED-SC in training efficiency and clustering accuracy. Moreover, we are the first one to apply residual encoder-decoder on unsupervised learning tasks.

red-sc, subspace, subspace clustering, (13 more...)

arXiv.org Machine Learning

1910.05569

Country: Asia > China > Guangdong Province > Shenzhen (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Fairness in Clustering with Multiple Sensitive Attributes

Abraham, Savitha Sam, P, Deepak, Sundaram, Sowmya S

arXiv.org Artificial IntelligenceOct-11-2019

A clustering may be considered as fair on pre-specified sensitive attributes if the proportions of sensitive attribute groups in each cluster reflect that in the dataset. In this paper, we consider the task of fair clustering for scenarios involving multiple multi-valued or numeric sensitive attributes. We propose a fair clustering method, FairKM (Fair K-Means), that is inspired by the popular K-Means clustering formulation. We outline a computational notion of fairness which is used along with a cluster coherence objective, to yield the FairKM clustering method. We empirically evaluate our approach, wherein we quantify both the quality and fairness of clusters, over real-world datasets. Our experimental evaluation illustrates that the clusters generated by FairKM fare significantly better on both clustering quality and fair representation of sensitive attribute groups compared to the clusters from a state-of-the-art baseline fair clustering method.

dataset, fairkm, representation, (15 more...)

arXiv.org Artificial Intelligence

1910.05113

Country:

Asia > India (0.04)
North America > United States > California > Alameda County > Oakland (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Rk-means: Fast Clustering for Relational Data

Curtin, Ryan, Moseley, Ben, Ngo, Hung Q., Nguyen, XuanLong, Olteanu, Dan, Schleich, Maximilian

arXiv.org Machine LearningOct-10-2019

Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k -means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small {\it grid coreset} of the data matrix for subsequent cluster construction. This gives a constant approximation for the k -means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.

algorithm, artificial intelligence, survey article, (18 more...)

arXiv.org Machine Learning

1910.04939

Country:

North America > United States (0.93)
Oceania > Australia (0.14)
Europe > France (0.14)

Genre:

Research Report (0.70)
Workflow (0.48)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Combining Geometric and Topological Information in Image Segmentation

Luo, Hengrui, Strait, Justin, Saha, Abhijoy

arXiv.org Machine LearningOct-10-2019

A fundamental problem in computer vision is image segmentation, where the goal is to delineate the boundary of the object in the image. The focus of this work is on the segmentation of grayscale images and its purpose is two-fold. First, we conduct an in-depth study comparing active contour and topologically-based methods, two popular approaches for boundary detection of 2-dimensional images. Certain properties of the image dataset may favor one method over the other, both from an interpretability perspective as well as through evaluation of performance measures. Second, we propose the use of topological knowledge to assist an active contour method, which can potentially incorporate prior shape information. The latter is known to be extremely sensitive to algorithm initialization, and thus, we use a topological model to provide an automatic initialization. In addition, our proposed model can handle objects in images with more complex topological structures. We demonstrate this on artificially-constructed image datasets from computer vision, as well as real medical image data.

boundary, contour, segmentation, (16 more...)

arXiv.org Machine Learning

1910.04778

Country:

North America > United States > Ohio (0.04)
North America > United States > New York (0.04)
Asia > Japan (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback