AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

STICC: A multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity

Kang, Yuhao, Wu, Kunlin, Gao, Song, Ng, Ignavier, Rao, Jinmeng, Ye, Shan, Zhang, Fan, Fei, Teng

arXiv.org Machine LearningMar-30-2022

Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-Based Clustering (STICC) method that considers both attributes and spatial relationships of geographic objects for multivariate spatial clustering. A subregion is created for each geographic object serving as the basic unit when performing clustering. A Markov random field is then constructed to characterize the attribute dependencies of subregions. Using a spatial consistency strategy, nearby objects are encouraged to belong to the same cluster. To test the performance of the proposed STICC algorithm, we apply it in two use cases. The comparison results with several baseline methods show that the STICC outperforms others significantly in terms of adjusted rand index and macro-F1 score. Join count statistics is also calculated and shows that the spatial contiguity is well preserved by STICC. Such a spatial clustering method may benefit various applications in the fields of geography, remote sensing, transportation, and urban planning, etc.

artificial intelligence, machine learning, spatial reasoning, (15 more...)

arXiv.org Machine Learning

doi: 10.1080/13658816.2022.2053980

2203.09611

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Graph similarity learning for change-point detection in dynamic networks

Sulem, Deborah, Kenlay, Henry, Cucuringu, Mihai, Dong, Xiaowen

arXiv.org Machine LearningMar-29-2022

Dynamic networks are ubiquitous for modelling sequential graph-structured data, e.g., brain connectome, population flows and messages exchanges. In this work, we consider dynamic networks that are temporal sequences of graph snapshots, and aim at detecting abrupt changes in their structure. This task is often termed network change-point detection and has numerous applications, such as fraud detection or physical motion monitoring. Leveraging a graph neural network model, we design a method to perform online network change-point detection that can adapt to the specific network domain and localise changes with no delay. The main novelty of our method is to use a siamese graph neural network architecture for learning a data-driven graph similarity function, which allows to effectively compare the current graph and its recent history. Importantly, our method does not require prior knowledge on the network generative distribution and is agnostic to the type of change-points; moreover, it can be applied to a large variety of networks, that include for instance edge weights and node attributes. We show on synthetic and real data that our method enjoys a number of benefits: it is able to learn an adequate graph similarity function for performing online network change-point detection in diverse types of change-point settings, and requires a shorter data history to detect changes than most existing state-of-the-art baselines.

artificial intelligence, dynamic network, machine learning, (18 more...)

arXiv.org Machine Learning

2203.1547

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Spain > Galicia > Madrid (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Selective inference for k-means clustering

Chen, Yiqun T., Witten, Daniela M.

arXiv.org Machine LearningMar-29-2022

If the groups under investigation are pre-specified, i.e., not a function of the observed data, then classical hypothesis tests will control the Type I error rate. However, it is increasingly common to want to test for a difference in means between groups that are defined through the observed data, e.g., via the output of a clustering algorithm. For instance, in single-cell RNA-sequencing analysis, researchers often first cluster the cells, and then test for a difference in the expected gene expression levels between the clusters to quantify up-or down-regulation of genes, annotate known cell types, and identify new cell types (Grün et al., 2015; Aizarani et al., 2019; Lähnemann et al., 2020; Zhang et al., 2019; Doughty & Kerkhoven, 2020). In fact, the inferential challenges resulting from testing data-guided hypotheses have been described as a "grand challenge" in the field of genomics (Lähnemann et al., 2020), and papers in the field continue to overlook this issue: as an example, seurat (Stuart et al., 2019), the state-of-the-art single-cell RNA sequencing analysis tool, tests for differential gene expression between groups obtained via clustering, with a note that "p-values [from these hypotheses] should be interpreted cautiously, as the genes used for clustering are the same genes tested for differential expression." Testing data-guided hypothesis also arises in the field of neuroscience (Kriegeskorte et al., 2009; Button, 2019), social psychology (Hung & Fithian, 2020), and physical sciences (Friederich et al., 2020; Pollice

artificial intelligence, inference, machine learning, (17 more...)

arXiv.org Machine Learning

2203.15267

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.88)
Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Elbow detection for clustering using splines

#artificialintelligenceMar-28-2022, 14:19:53 GMT

Among the methods offered by machine learning and artificial intelligence, clustering methods are among the most interesting. These methods belong to the class of unsupervised methods, and as such do not suffer from bias or presuppositions, since they do not seek to learn a known rule, but rather to identify unknown links. Their appeal, therefore, lies in their ability to make sense of data whose volume and/or cardinality exceed the processing capabilities of a human. There are in the field of artificial intelligence, two major classes of methods: the supervised approach and the unsupervised approach. They are distinguished by the form of the problem that is submitted to machine learning.

curvature, spline, variance, (13 more...)

#artificialintelligence

Industry: Health & Medicine > Therapeutic Area (0.97)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.37)

Add feedback

Convex Non-negative Matrix Factorization Through Quantum Annealing

Zaiou, Ahmed, Matei, Basarab, Bennani, Younès, Hibti, Mohamed

arXiv.org Machine LearningMar-28-2022

In this paper we provide the quantum version of the Convex Non-negative Matrix Factorization algorithm (Convex-NMF) by using the D-wave quantum annealer. More precisely, we use D-wave 2000Q to find the low rank approximation of a fixed real-valued matrix X by the product of two non-negative matrices factors W and G such that the Frobenius norm of the difference X-XWG is minimized. In order to solve this optimization problem we proceed in two steps. In the first step we transform the global real optimization problem depending on W,G into two quadratic unconstrained binary optimization problems (QUBO) depending on W and G respectively. In the second step we use an alternative strategy between the two QUBO problems corresponding to W and G to find the global solution. The running of these two QUBO problems on D-wave 2000Q need to use an embedding to the chimera graph of D-wave 2000Q, this embedding is limited by the number of qubits of D-wave 2000Q. We perform a study on the maximum number of real data to be used by our approach on D-wave 2000Q. The proposed study is based on the number of qubits used to represent each real variable. We also tested our approach on D-Wave 2000Q with several randomly generated data sets to prove that our approach is faster than the classical approach and also to prove that it gets the best results.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Machine Learning

2203.15634

Country: Europe > France (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

What is K-Means?

#artificialintelligenceMar-27-2022, 01:05:26 GMT

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this article, I would be giving you a detailed explanation and how this model works. Unsupervised Learning method K-Means Clustering divides the unlabeled dataset into various clusters. K specifies the number of pre-defined clusters that must be produced throughout the process; for example, if K 2, two clusters will be created, and if K 3, three clusters will be created, and so on. It allows us to cluster data into distinct groups and provides a simple technique to determine the categories of groups in an unlabeled dataset without any training. It's a centroid-based approach, which means that each cluster has its own centroid.

algorithm, k-means, unlabeled dataset, (1 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

DeepDPM: Deep Clustering With an Unknown Number of Clusters

Ronen, Meitar, Finder, Shahaf E., Freifeld, Oren

arXiv.org Machine LearningMar-27-2022

Deep Learning (DL) has shown great promise in the unsupervised task of clustering. That said, while in classical (i.e., non-deep) clustering the benefits of the nonparametric approach are well known, most deep-clustering methods are parametric: namely, they require a predefined and fixed number of clusters, denoted by K. When K is unknown, however, using model-selection criteria to choose its optimal value might become computationally expensive, especially in DL as the training process would have to be repeated numerous times. In this work, we bridge this gap by introducing an effective deep-clustering method that does not require knowing the value of K as it infers it during the learning. Using a split/merge framework, a dynamic architecture that adapts to the changing K, and a novel loss, our proposed method outperforms existing nonparametric methods (both classical and deep ones). While the very few existing deep nonparametric methods lack scalability, we demonstrate ours by being the first to report the performance of such a method on ImageNet. We also demonstrate the importance of inferring K by showing how methods that fix it deteriorate in performance when their assumed K value gets further from the ground-truth one, especially on imbalanced datasets. Our code is available at https://github.com/BGU-CS-VIL/DeepDPM.

artificial intelligence, deepdpm, machine learning, (17 more...)

arXiv.org Machine Learning

2203.14309

Country:

North America > United States > Massachusetts (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Concept Embedding Analysis: A Review

Schwalbe, Gesina

arXiv.org Machine LearningMar-25-2022

Deep neural networks (DNNs) have found their way into many applications with potential impact on the safety, security, and fairness of human-machine-systems. Such require basic understanding and sufficient trust by the users. This motivated the research field of explainable artificial intelligence (XAI), i.e. finding methods for opening the "black-boxes" DNNs represent. For the computer vision domain in specific, practical assessment of DNNs requires a globally valid association of human interpretable concepts with internals of the model. The research field of concept (embedding) analysis (CA) tackles this problem: CA aims to find global, assessable associations of humanly interpretable semantic concepts (e.g., eye, bearded) with internal representations of a DNN. This work establishes a general definition of CA and a taxonomy for CA methods, uniting several ideas from literature. That allows to easily position and compare CA approaches. Guided by the defined notions, the current state-of-the-art research regarding CA methods and interesting applications are reviewed. More than thirty relevant methods are discussed, compared, and categorized. Finally, for practitioners, a survey of fifteen datasets is provided that have been used for supervised concept analysis. Open challenges and research directions are pointed out at the end.

data mining, machine learning, natural language, (22 more...)

arXiv.org Machine Learning

2203.13909

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California (0.04)
Europe > Germany > Bavaria > Regensburg (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology (0.92)
Government > Regional Government (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(6 more...)

Add feedback

Understanding K-Means Clustering Algorithm - Analytics Vidhya

#artificialintelligenceMar-24-2022, 02:00:20 GMT

With the rising use of the Internet in today's society, the quantity of data created is incomprehensibly huge. Even though the nature of individual data is straightforward, the sheer amount of data to be analyzed makes processing difficult for even computers. To manage such procedures, we need large data analysis tools. Data mining methods and techniques, in conjunction with machine learning, enable us to analyze large amounts of data in an intelligible manner. It is capable of classifying unlabeled data into a predetermined number of clusters based on similarities (k).

centroid, dataset, k-means, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Addressing Missing Sources with Adversarial Support-Matching

Kehrenberg, Thomas, Bartlett, Myles, Sharmanska, Viktoriia, Quadrianto, Novi

arXiv.org Machine LearningMar-24-2022

When trained on diverse labeled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection and curation, the labeled training set might exhibit a systematic shortage of data for certain groups. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected groups from algorithmic fairness, we refer to the partitions carved by this second level as "subgroups"; we refer to combinations of subgroups and classes, or leaves of the hierarchy, as "sources". To characterize the problem, we introduce the concept of classes with incomplete subgroup support. The representational bias in the training set can give rise to spurious correlations between the classes and the subgroups which render standard classification models ungeneralizable to unseen sources. To overcome this bias, we make use of an additional, diverse but unlabeled dataset, called the "deployment set", to learn a representation that is invariant to subgroup. This is done by adversarially matching the support of the training and deployment sets in representation space. In order to learn the desired invariance, it is paramount that the sets of samples observed by the discriminator are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method with experiments on several datasets and variants of the problem.

artificial intelligence, clustering, machine learning, (16 more...)

arXiv.org Machine Learning

2203.13154

Country:

Europe > United Kingdom > England > East Sussex > Brighton (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback