AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Clustering in Python

#artificialintelligenceDec-4-2020, 20:19:32 GMT

This article was published as a part of the Data Science Blogathon. Cluster analysis or clustering is an unsupervised machine learning algorithm that groups unlabeled datasets. It aims to form clusters or groups using the data points in a dataset in such a way that there is high intra-cluster similarity and low inter-cluster similarity. In, layman terms clustering aims at forming subsets or groups within a dataset consisting of data points which are really similar to each other and the groups or subsets or clusters formed can be significantly differentiated from each other. Let's assume we have a dataset and we don't know anything about it.

algorithm, centroid, new centroid, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Selective Inference for Hierarchical Clustering

Gao, Lucy L., Bien, Jacob, Witten, Daniela

arXiv.org Machine LearningDec-4-2020

Testing for a difference in means between two groups is fundamental to answering research questions across virtually every scientific area. Classical tests control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via a clustering algorithm, then applying a classical test for a difference in means between the groups yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method. Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-seq data.

euclidean distance, linkage, true cluster, (15 more...)

arXiv.org Machine Learning

2012.02936

Country:

North America > United States > California (0.14)
Antarctica (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Adaptive Explicit Kernel Minkowski Weighted K-means

Aradnia, Amir, Haeri, Maryam Amir, Ebadzadeh, Mohammad Mehdi

arXiv.org Machine LearningDec-4-2020

The K-means algorithm is among the most commonly used data clustering methods. However, the regular K-means can only be applied in the input space and it is applicable when clusters are linearly separable. The kernel K-means, which extends K-means into the kernel space, is able to capture nonlinear structures and identify arbitrarily shaped clusters. However, kernel methods often operate on the kernel matrix of the data, which scale poorly with the size of the matrix or suffer from the high clustering cost due to the repetitive calculations of kernel values. Another issue is that algorithms access the data only through evaluations of $K(x_i, x_j)$, which limits many processes that can be done on data through the clustering task. This paper proposes a method to combine the advantages of the linear and nonlinear approaches by using driven corresponding approximate finite-dimensional feature maps based on spectral analysis. Applying approximate finite-dimensional feature maps were only discussed in the Support Vector Machines (SVM) problems before. We suggest using this method in kernel K-means era as alleviates storing huge kernel matrix in memory, further calculating cluster centers more efficiently and access the data explicitly in feature space. These explicit feature maps enable us to access the data in the feature space explicitly and take advantage of K-means extensions in that space. We demonstrate our Explicit Kernel Minkowski Weighted K-mean (Explicit KMWK-mean) method is able to be more adopted and find best-fitting values in new space by applying additional Minkowski exponent and feature weights parameter. Moreover, it can reduce the impact of concentration on nearest neighbour search by suggesting investigate among other norms instead of Euclidean norm, includes Minkowski norms and fractional norms (as an extension of the Minkowski norms with p<1).

algorithm, k-means, kernel, (15 more...)

arXiv.org Machine Learning

2012.02805

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > California > Alameda County > Oakland (0.04)
Europe > Netherlands (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hierarchical Clustering and Zeroth Persistent Homology

Güzel, İsmail, Kaygun, Atabey

arXiv.org Machine LearningDec-4-2020

In this article, we show that hierarchical clustering and the zeroth persistent homology do deliver the same topological information about a given data set. We show this fact using cophenetic matrices constructed out of the filtered Vietoris-Rips complex of the data set at hand. As in any cophenetic matrix, one can also display the inter-relations of zeroth homology classes via a rooted tree, also known as a dendogram. Since homological cophenetic matrices can be calculated for higher homologies, one can also sketch similar dendograms for higher persistent homology classes.

homology, information, persistent homology, (16 more...)

arXiv.org Machine Learning

2012.02655

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Add feedback

A Novel index-based multidimensional data organization model that enhances the predictability of the machine learning algorithms

Rahman, Mahbubur

arXiv.org Artificial IntelligenceDec-3-2020

Learning from the multidimensional data has been an interesting concept in the field of machine learning. However, such learning can be difficult, complex, expensive because of expensive data processing, manipulations as the number of dimension increases. As a result, we have introduced an ordered index-based data organization model as the ordered data set provides easy and efficient access than the unordered one and finally, such organization can improve the learning. The ordering maps the multidimensional dataset in the reduced space and ensures that the information associated with the learning can be retrieved back and forth efficiently. We have found that such multidimensional data storage can enhance the predictability for both the unsupervised and supervised machine learning algorithms.

algorithm, data organization model, euclidean norm, (9 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/csit.2020.101210

2012.02007

Country:

North America > United States > Texas > Fort Bend County > Stafford (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Alaska > Anchorage Municipality > Anchorage (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.73)

Add feedback

Robustly Learning Mixtures of $k$ Arbitrary Gaussians

Bakshi, Ainesh, Diakonikolas, Ilias, Jia, He, Kane, Daniel M., Kothari, Pravesh K., Vempala, Santosh S.

arXiv.org Machine LearningDec-3-2020

We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\mathbb{R}^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mixture of TV-distance separated Gaussians, and (c) a uniform mixture of two Gaussians. Our main tools are an efficient \emph{partial clustering} algorithm that relies on the sum-of-squares method, and a novel tensor decomposition algorithm that allows errors in both Frobenius norm and low-rank terms.

algorithm, gaussian, nullnull, (16 more...)

arXiv.org Machine Learning

2012.02119

Country:

North America > United States > New York (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)
(4 more...)

Genre: Research Report (0.49)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Consistent Representation Learning for High Dimensional Data Analysis

Li, Stan Z., Wu, Lirong, Zang, Zelin

arXiv.org Machine LearningDec-1-2020

High dimensional data analysis for exploration and discovery includes three fundamental tasks: dimensionality reduction, clustering, and visualization. When the three associated tasks are done separately, as is often the case thus far, inconsistencies can occur among the tasks in terms of data geometry and others. This can lead to confusing or misleading data interpretation. In this paper, we propose a novel neural network-based method, called Consistent Representation Learning (CRL), to accomplish the three associated tasks end-to-end and improve the consistencies. The CRL network consists of two nonlinear dimensionality reduction (NLDR) transformations: (1) one from the input data space to the latent feature space for clustering, and (2) the other from the clustering space to the final 2D or 3D space for visualization. Importantly, the two NLDR transformations are performed to best satisfy local geometry preserving (LGP) constraints across the spaces or network layers, to improve data consistencies along with the processing flow. Also, we propose a novel metric, clustering-visualization inconsistency (CVI), for evaluating the inconsistencies. Extensive comparative results show that the proposed CRL neural network method outperforms the popular t-SNE and UMAP-based and other contemporary clustering and visualization algorithms in terms of evaluation metrics and visualization.

constraint, dataset, visualization, (15 more...)

arXiv.org Machine Learning

2012.00481

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Mayurji/N2D-Pytorch

#artificialintelligenceNov-30-2020, 06:33:24 GMT

Deep clustering has increasingly been demonstrating superiority over conventional shallow clustering algorithms. Deep clustering algorithms usually combine representation learning with deep neural networks to achieve this performance, typically optimizing a clustering and non-clustering loss. In such cases, an autoencoder is typically connected with a clustering network, and the final clustering is jointly learned by both the autoencoder and clustering network. Instead, we propose to learn an autoencoded embedding and then search this further for the underlying manifold. We study a number of local and global manifold learning methods on both the raw data and autoencoded embedding, concluding that UMAP in our framework is able to find the best clusterable manifold of the embedding.

algorithm, manifold, mayurji n2d-pytorch

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Trad, Rafi, Spiliopoulou, Myra

arXiv.org Artificial IntelligenceNov-30-2020

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints). We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-arts. We also demonstrate that, while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task. Thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts

algorithm, authorial, constraint, (16 more...)

arXiv.org Artificial Intelligence

2011.15038

Country:

Europe > Germany > Saxony-Anhalt > Magdeburg (0.05)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Doubly Stochastic Subspace Clustering

Lim, Derek, Vidal, René, Haeffele, Benjamin D.

arXiv.org Artificial IntelligenceNov-30-2020

Many state-of-the-art subspace clustering methods follow a two-step process by first constructing an affinity matrix between data points and then applying spectral clustering to this affinity. Most of the research into these methods focuses on the first step of generating the affinity matrix, which often exploits the self-expressive property of linear subspaces, with little consideration typically given to the spectral clustering step that produces the final clustering. Moreover, existing methods obtain the affinity by applying ad-hoc postprocessing steps to the self-expressive representation of the data, and this postprocessing can have a significant impact on the subsequent spectral clustering step. In this work, we propose to unify these two steps by jointly learning both a self-expressive representation of the data and an affinity matrix that is well-normalized for spectral clustering. In the proposed model, we constrain the affinity matrix to be doubly stochastic, which results in a principled method for affinity matrix normalization while also exploiting the known benefits of doubly stochastic normalization in spectral clustering. While our proposed model is non-convex, we give a convex relaxation that is provably equivalent in many regimes; we also develop an efficient approximation to the full model that works well in practice. Experiments show that our method achieves state-of-the-art subspace clustering performance on many common datasets in computer vision.

affinity, spectral, subspace, (14 more...)

arXiv.org Artificial Intelligence

2011.14859

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (1.00)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback