AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Approximating Hierarchical MV-sets for Hierarchical Clustering

Glazer, Assaf, Weissbrod, Omer, Lindenbaum, Michael, Markovitch, Shaul

Neural Information Processing SystemsFeb-14-2020, 07:11:51 GMT

The goal of hierarchical clustering is to construct a cluster tree, which can be viewed as the modal structure of a density. For this purpose, we use a convex optimization program that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. We further extend existing graph-based methods to approximate the cluster tree of a distribution. By avoiding direct density estimation, our method is able to handle high-dimensional data more efficiently than existing density-based approaches. We present empirical results that demonstrate the superiority of our method over existing ones.

approximating hierarchical mv-set, cluster tree, hierarchical clustering

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Rieger, Jonas, Koppers, Lars, Jentsch, Carsten, Rahnenführer, Jörg

arXiv.org Artificial IntelligenceFeb-14-2020

For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

run1, run3, run4, (14 more...)

arXiv.org Artificial Intelligence

2003.0498

Country:

Europe > Austria > Vienna (0.14)
Europe > United Kingdom (0.05)
Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment > Sports > Football (0.95)
Government > Regional Government > North America Government > United States Government (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)

Add feedback

Scalable Dyadic Independence Models with Local and Global Constraints

Adriaens, Florian, Mara, Alexandru, Lijffijt, Jefrey, De Bie, Tijl

arXiv.org Machine LearningFeb-14-2020

An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large networks. By utilizing matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) models, while being able to meaningfully model local information (degrees) as well as global information (clustering coefficient, assortativity, etc.) if desired. This allows one to efficiently generate random networks with similar properties as an observed network, scalable up to sparse graphs consisting of millions of nodes. Empirical evaluation demonstrates its competitiveness in terms of accuracy with state-of-the-art methods for link prediction and network reconstruction.

constraint, graph, scalable dyadic independence model, (14 more...)

arXiv.org Machine Learning

2002.07076

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Belgium (0.04)

Genre: Research Report (0.84)

Industry: Information Technology (0.69)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Clustering based on Point-Set Kernel

Ting, Kai Ming, Wells, Jonathan R., Zhu, Ye

arXiv.org Machine LearningFeb-13-2020

Measuring similarity between two objects is the core operation in existing cluster analyses in grouping similar objects into clusters. Cluster analyses have been applied to a number of applications, including image segmentation, social network analysis, and computational biology. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a sample of objects generated from an unknown distribution. The proposed clustering procedure utilizes this new measure to characterize both the typical point of every cluster and the cluster grown from the typical point. We show that the new clustering procedure is both effective and efficient such that it can deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective; and even efficient ones have difficulty dealing with large scale datasets without special hardware. We show that the proposed algorithm is more effective and runs orders of magnitude faster than the state-of-the-art density-peak clustering and scalable kernel k-means clustering when applying to datasets of millions of data points, on commonly used computing machines.

artificial intelligence, kernel k-means, machine learning, (17 more...)

arXiv.org Machine Learning

2002.05815

Country:

Oceania > Australia (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
North America > United States > Massachusetts > Middlesex County > Newton (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

Tree-SNE: Hierarchical Clustering and Visualization Using t-SNE

Robinson, Isaac, Pierce-Hoffman, Emma

arXiv.org Machine LearningFeb-13-2020

Building on recent advances in speeding up t-SNE and obtaining finer-grained structure, we combine the two to create tree-SNE, a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings. We also introduce alpha-clustering, which recommends the optimal cluster assignment, without foreknowledge of the number of clusters, based off of the cluster stability across multiple scales. We demonstrate the effectiveness of tree-SNE and alphaclustering on images of handwritten digits, mass cytometry (CyTOF) data from blood cells, and single-cell RNAsequencing (scRNA-seq) data from retinal cells. Furthermore, to demonstrate the validity of the visualization, we use alpha-clustering to obtain unsupervised clustering results competitive with the state of the art on several image data sets. Software is available at https: //github.com/isaacrob/treesne.

alpha-clustering label, perplexity, visualization, (16 more...)

arXiv.org Machine Learning

2002.05687

Country:

North America > United States > Connecticut > New Haven County > New Haven (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Federated Clustering via Matrix Factorization Models: From Model Averaging to Gradient Sharing

Wang, Shuai, Chang, Tsung-Hui

arXiv.org Machine LearningFeb-12-2020

Recently, federated learning (FL) has drawn significant attention due to its capability of training a model over the network without knowing the client's private raw data. In this paper, we study the unsupervised clustering problem under the FL setting. By adopting a generalized matrix factorization model for clustering, we propose two novel (first-order) federated clustering (FedC) algorithms based on principles of model averaging and gradient sharing, respectively, and present their theoretical convergence conditions. We show that both algorithms have a O(1/T) convergence rate, where T is the total number of gradient evaluations per client, and the communication cost can be effectively reduced by controlling the local epoch length and allowing partial client participation within each communication round. Numerical experiments show that the FedC algorithm based on gradient sharing outperforms that based on model averaging, especially in scenarios with non-i.i.d. I. INTRODUCTION As one of the most fundamental data mining tasks, unsupervised clustering has a vast range of applications [1]. In view of the increasing volume of real-life data, distributed clustering methods that can process large-scale datasets in parallel computing environments have gained significant interests in the last decade [2], [3], [4]. However, recent emphasis on user privacy has called for new distributed schemes that can perform clustering without directly accessing the users' raw data.

algorithm, dataset, fedcgd, (14 more...)

arXiv.org Machine Learning

2002.0493

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(18 more...)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Structural Deep Clustering Network

Bo, Deyu, Wang, Xiao, Shi, Chuan, Zhu, Meiqi, Lu, Emiao, Cui, Peng

arXiv.org Machine LearningFeb-12-2020

Clustering is a fundamental task in data analysis. Recently, deep clustering, which derives inspiration primarily from deep learning approaches, achieves state-of-the-art performance and has attracted considerable attention. Current deep clustering methods usually boost the clustering results by means of the powerful representation ability of deep learning, e.g., autoencoder, suggesting that learning an effective representation for clustering is a crucial requirement. The strength of deep clustering methods is to extract the useful representations from the data itself, rather than the structure of data, which receives scarce attention in representation learning. Motivated by the great success of Graph Convolutional Network (GCN) in encoding the graph structure, we propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering. Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures and guide the update of the whole model. In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder. Furthermore, we theoretically analyze the delivery operator, i.e., with the delivery operator, GCN improves the autoencoder-specific representation as a high-order graph regularization constraint and autoencoder helps alleviate the over-smoothing problem in GCN. Through comprehensive experiments, we demonstrate that our propose model can consistently perform better over the state-of-the-art techniques.

autoencoder, information, representation, (15 more...)

arXiv.org Machine Learning

doi: 10.1145/3366423.3380214

2002.01633

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
Asia > China > Beijing > Beijing (0.05)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Clustering Algorithms

#artificialintelligenceFeb-10-2020, 23:10:39 GMT

Let's quickly look at types of clustering algorithms and when you should choose each type. When choosing a clustering algorithm, you should consider whether the algorithm scales to your dataset. Datasets in machine learning can have millions of examples, but not all clustering algorithms scale efficiently. Many clustering algorithms work by computing the similarity between all pairs of examples. This means their runtime increases as the square of the number of examples \(n\), denoted as \(O(n 2)\) in complexity notation.

algorithm, clustering algorithm, gaussian distribution, (3 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

An Intuition to K-Means Clustering

#artificialintelligenceFeb-10-2020, 17:49:47 GMT

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Unsupervised algorithms make inferences from data sets using only input vectors without referring to known, or labeled, outcomes. Unsupervised learning allows us to approach problems with little or no idea what our results should look like. Unsupervised algorithms find patterns based only on input data. The technique is useful when we're not quite sure what to look for.

algorithm, intuition, k-means clustering, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.80)

Add feedback

Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model

Keriven, Nicolas, Vaiter, Samuel

arXiv.org Machine LearningFeb-10-2020

In this paper, we analyse classical variants of the Spectral Clustering (SC) algorithm in the Dynamic Stochastic Block Model (DSBM). Existing results show that, in the relatively sparse case where the expected degree grows logarithmically with the number of nodes, guarantees in the static case can be extended to the dynamic case and yield improved error bounds when the DSBM is sufficiently smooth in time, that is, the communities do not change too much between two time steps. We improve over these results by drawing a new link between the sparsity and the smoothness of the DSBM: the more regular the DSBM is, the more sparse it can be, while still guaranteeing consistent recovery. In particular, a mild condition on the smoothness allows to treat the sparse case with bounded degree. We also extend these guarantees to the normalized Laplacian, and as a by-product of our analysis, we obtain to our knowledge the best spectral concentration bound available for the normalized Laplacian of matrices with independent Bernoulli entries.

adjacency matrix, normalized laplacian, probability, (13 more...)

arXiv.org Machine Learning

2002.02892

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Europe > France > Bourgogne-Franche-Comté > Côte-d'Or > Dijon (0.04)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback