AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Estimating the Number of Components in Finite Mixture Models via the Group-Sort-Fuse Procedure

Manole, Tudor, Khalili, Abbas

arXiv.org Machine LearningMay-23-2020

Estimation of the number of components (or order) of a finite mixture model is a long standing and challenging problem in statistics. We propose the Group-Sort-Fuse (GSF) procedure---a new penalized likelihood approach for simultaneous estimation of the order and mixing measure in multidimensional finite mixture models. Unlike methods which fit and compare mixtures with varying orders using criteria involving model complexity, our approach directly penalizes a continuous function of the model parameters. More specifically, given a conservative upper bound on the order, the GSF groups and sorts mixture component parameters to fuse those which are redundant. For a wide range of finite mixture models, we show that the GSF is consistent in estimating the true mixture order and achieves the $n^{-1/2}$ convergence rate for parameter estimation up to polylogarithmic factors. The GSF is implemented for several univariate and multivariate mixture models in the R package GroupSortFuse. Its finite sample performance is supported by a thorough simulation study, and its application is illustrated on two real data examples.

artificial intelligence, machine learning, mixture model, (21 more...)

arXiv.org Machine Learning

2005.11641

Country:

North America > United States > New York (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Modeling & Simulation (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Clustering Frankenstein

#artificialintelligenceMay-22-2020, 07:40:28 GMT

From time to time I come back to experiment with this stunning photograph of Boris Karloff as Frankenstein's monster. I have done several of them previously: from decomposing it into Voronoi regions, to draw it as a single line portrait using an algorithm to solve the travelling salesman problem. I also used this last technique to do a pencil portrait of the image. Today I will use a machine learning algorithm to reinterpret the monster once again. The idea is simple: once loaded the photograph, the first step is to binarize it into a black and white image using thresold function of imager package.

artificial intelligence, clustering frankenstein, machine learning, (6 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.37)

Add feedback

Intent Mining from past conversations for Conversational Agent

Chatterjee, Ajay, Sengupta, Shubhashis

arXiv.org Artificial IntelligenceMay-22-2020

Conversational systems are of primary interest in the AI community. Chatbots are increasingly being deployed to provide round-the-clock support and to increase customer engagement. Many of the commercial bot building frameworks follow a standard approach that requires one to build and train an intent model to recognize a user input. Intent models are trained in a supervised setting with a collection of textual utterance and intent label pairs. Gathering a substantial and wide coverage of training data for different intent is a bottleneck in the bot building process. Moreover, the cost of labeling a hundred to thousands of conversations with intent is a time consuming and laborious job. In this paper, we present an intent discovery framework that involves 4 primary steps: Extraction of textual utterances from a conversation using a pre-trained domain agnostic Dialog Act Classifier (Data Extraction), automatic clustering of similar user utterances (Clustering), manual annotation of clusters with an intent label (Labeling) and propagation of intent labels to the utterances from the previous step, which are not mapped to any cluster (Label Propagation); to generate intent training data from raw conversations. We have introduced a novel density-based clustering algorithm ITER-DBSCAN for unbalanced data clustering. Subject Matter Expert (Annotators with domain expertise) manually looks into the clustered user utterances and provides an intent label for discovery. We conducted user studies to validate the effectiveness of the trained intent model generated in terms of coverage of intents, accuracy and time saving concerning manual annotation. Although the system is developed for building an intent model for the conversational system, this framework can also be used for a short text clustering or as a labeling framework.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2005.11014

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > India (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Simple, Scalable, and Stable Variational Deep Clustering

Cao, Lele, Asadi, Sahar, Zhu, Wenfei, Schmidli, Christian, Sjöberg, Michael

arXiv.org Machine LearningMay-21-2020

Deep clustering (DC) has become the state-of-the-art for unsupervised clustering. In principle, DC represents a variety of unsupervised methods that jointly learn the underlying clusters and the latent representation directly from unstructured datasets. However, DC methods are generally poorly applied due to high operational costs, low scalability, and unstable results. In this paper, we first evaluate several popular DC variants in the context of industrial applicability using eight empirical criteria. We then choose to focus on variational deep clustering (VDC) methods, since they mostly meet those criteria except for simplicity, scalability, and stability. To address these three unmet criteria, we introduce four generic algorithmic improvements: initial $\gamma$-training, periodic $\beta$-annealing, mini-batch GMM (Gaussian mixture model) initialization, and inverse min-max transform. We also propose a novel clustering algorithm S3VDC (simple, scalable, and stable VDC) that incorporates all those improvements. Our experiments show that S3VDC outperforms the state-of-the-art on both benchmark tasks and a large unstructured industrial dataset without any ground truth label. In addition, we analytically evaluate the usability and interpretability of S3VDC.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2005.08047

Country:

Europe > Sweden > Uppsala County > Uppsala (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Local semi-supervised approach to brain tissue classification in child brain MRI

Portman, Nataliya, Toussaint, Paule-J, Evans, Alan C.

arXiv.org Machine LearningMay-20-2020

Most segmentation methods in child brain MRI are supervised and are based on global intensity distributions of major brain structures. The successful implementation of a supervised approach depends on availability of an age-appropriate probabilistic brain atlas. For the study of early normal brain development, the construction of such a brain atlas remains a significant challenge. Moreover, using global intensity statistics leads to inaccurate detection of major brain tissue classes due to substantial intensity variations of MR signal within the constituent parts of early developing brain. In order to overcome these methodological limitations we develop a local, semi-supervised framework. It is based on Kernel Fisher Discriminant Analysis (KFDA) for pattern recognition, combined with an objective structural similarity index (SSIM) for perceptual image quality assessment. The proposed method performs optimal brain partitioning into subdomains having different average intensity values followed by SSIM-guided computation of separating surfaces between the constituent brain parts. The classified image subdomains are then stitched slice by slice via simulated annealing to form a global image of the classified brain. In this paper, we consider classification into major tissue classes (white matter and grey matter) and the cerebrospinal fluid and illustrate the proposed framework on examples of brain templates for ages 8 to 11 months and ages 44 to 60 months. We show that our method improves detection of the tissue classes by its comparison to state-of-the-art classification techniques known as Partial Volume Estimation.

artificial intelligence, classification, machine learning, (16 more...)

arXiv.org Machine Learning

2005.09871

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New York (0.04)
North America > United States > Maryland > Baltimore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

k-sums: another side of k-means

Zhao, Wan-Lei, Chen, Run-Qing, Ye, Hui, Ngo, Chong-Wah

arXiv.org Machine LearningMay-19-2020

In this paper, the decades-old clustering method k-means is revisited. The original distortion minimization model of k-means is addressed by a pure stochastic minimization procedure. In each step of the iteration, one sample is tentatively reallocated from one cluster to another. It is moved to another cluster as long as the reallocation allows the sample to be closer to the new centroid. This optimization procedure converges faster to a better local minimum over k-means and many of its variants. This fundamental modification over the k-means loop leads to the redefinition of a family of k-means variants. Moreover, a new target function that minimizes the summation of pairwise distances within clusters is presented. We show that it could be solved under the same stochastic optimization procedure. This minimization procedure built upon two minimization models outperforms k-means and its variants considerably with different settings and on different datasets.

artificial intelligence, centroid, machine learning, (19 more...)

arXiv.org Machine Learning

2005.09485

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Fujian Province > Xiamen (0.04)
North America > United States > New York (0.04)
North America > United States > New Jersey > Hudson County > Secaucus (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

PageRank and The K-Means Clustering Algorithm

Hajij, Mustafa, Said, Eyad, Todd, Robert

arXiv.org Machine LearningMay-19-2020

We introduce a graph clustering algorithm that generalizes $k$-means to graphs. Our method utilizes PageRank measures on graphs to quickly and robustly compute centrality of nodes in a given graph. Furthermore, we show how our method can be generalized to metric spaces and apply it to other domains such as point clouds and triangulated meshes.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

2005.04774

Country: North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Stable and consistent density-based clustering

Rolle, Alexander, Scoccola, Luis

arXiv.org Machine LearningMay-18-2020

We present a consistent approach to density-based clustering, which satisfies a stability theorem that holds without any distributional assumptions. We also show that the algorithm can be combined with standard procedures to extract a flat clustering from a hierarchical clustering, and that the resulting flat clustering algorithms satisfy stability theorems. The algorithms and proofs are inspired by topological data analysis.

artificial intelligence, machine learning, procedure, (16 more...)

arXiv.org Machine Learning

2005.09048

Country:

North America > United States > New York (0.04)
North America > Canada > Ontario (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A New Validity Index for Fuzzy-Possibilistic C-Means Clustering

Zarandi, Mohammad Hossein Fazel, Sotudian, Shahabeddin, Castillo, Oscar

arXiv.org Artificial IntelligenceMay-18-2020

In some complicated datasets, due to the presence of noisy data points and outliers, cluster validity indices can give conflicting results in determining the optimal number of clusters. This paper presents a new validity index for fuzzy-possibilistic c-means clustering called Fuzzy-Possibilistic)FP (index, which works well in the presence of clusters that vary in shape and density. Moreover, FPCM like most of the clustering algorithms is susceptible to some initial parameters. In this regard, in addition to the number of clusters, FPCM requires a priori selection of the degree of fuzziness (m) and the degree of typicality (η). Therefore, we presented an efficient procedure for determining an optimal value for and. The proposed approach has been evaluated using several synthetic and real-world datasets. Final computational results demonstrate the capabilities and reliability of the proposed approach compared with several well-known fuzzy validity indices in the literature. Furthermore, to clarify the ability of the proposed method in real applications, the proposed method is implemented in microarray gene expression data clustering and medical image segmentation.

artificial intelligence, machine learning, validity index, (18 more...)

arXiv.org Artificial Intelligence

2005.09162

Country:

North America > United States > Wisconsin (0.04)
North America > United States > New York (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.90)
Health & Medicine > Diagnostic Medicine > Imaging (0.49)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Towards Automatic Clustering Analysis Using Traces of Information Gain: The InfoGuide Method

Rocha, Paulo (University of Pernambuco ) | Pinheiro, Diego (University of California Davis) | Cadeiras, Martin (University of California Davis) | Bastos-Filho, Carmelo (University of Pernambuco)

AAAI ConferencesMay-16-2020

Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing traces of information gain between increasingly complex clustering retrievals---InfoGuide---enables an automatic clustering analysis with improved clustering retrievals. We validated the InfoGuide hypothesis by capturing the traces of information gain using the Kolmogorov-Smirnov statistic and comparing the clusters retrieved by InfoGuide against those retrieved by other commonly used internal metrics in artificially-generated, benchmarks, and real-world datasets. Our results suggested that InfoGuide can enable a more automatic clustering analysis and may be more suitable for retrieving clusters in real-world datasets displaying nontrivial statistical properties.

artificial intelligence, automatic clustering analysis, machine learning, (2 more...)

AAAI Conferences

The Thirty-Third International Flairs Conference

Genre: Research Report > New Finding (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.80)

Add feedback