AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Fast Data Driven Estimation of Cluster Number in Multiplex Images using Embedded Density Outliers

Thomas, Spencer A.

arXiv.org Artificial IntelligenceJul-21-2022

The usage of chemical imaging technologies is becoming a routine accompaniment to traditional methods in pathology. Significant technological advances have developed these next generation techniques to provide rich, spatially resolved, multidimensional chemical images. The rise of digital pathology has significantly enhanced the synergy of these imaging modalities with optical microscopy and immunohistochemistry, enhancing our understanding of the biological mechanisms and progression of diseases. Techniques such as imaging mass cytometry provide labelled multidimensional (multiplex) images of specific components used in conjunction with digital pathology techniques. These powerful techniques generate a wealth of high dimensional data that create significant challenges in data analysis. Unsupervised methods such as clustering are an attractive way to analyse these data, however, they require the selection of parameters such as the number of clusters. Here we propose a methodology to estimate the number of clusters in an automatic data-driven manner using a deep sparse autoencoder to embed the data into a lower dimensional space. We compute the density of regions in the embedded space, the majority of which are empty, enabling the high density regions to be detected as outliers and provide an estimate for the number of clusters. This framework provides a fully unsupervised and data-driven method to analyse multidimensional data. In this work we demonstrate our method using 45 multiplex imaging mass cytometry datasets. Moreover, our model is trained using only one of the datasets and the learned embedding is applied to the remaining 44 images providing an efficient process for data analysis. Finally, we demonstrate the high computational efficiency of our method which is two orders of magnitude faster than estimating via computing the sum squared distances as a function of cluster number.

dataset, feature map, imc dataset, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/CIBCB55180.2022.9863014

2207.10469

Country:

Europe > United Kingdom > Scotland (0.04)
Europe > United Kingdom > England > Surrey (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Diagnostic Medicine (0.74)
Health & Medicine > Pharmaceuticals & Biotechnology (0.69)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

5 Minutes Cheat Sheet Explaining all Machine Learning Models

#artificialintelligenceJul-20-2022, 08:52:14 GMT

Many times, it happens that you have an interview in a few days, and your schedule is jam-packed to prepare for it. Or maybe you are in revision mode and want to look at all the basic popular machine learning models. If that is the case, you have come to the right place. In this blog, I will briefly explain some of the most commonly asked machine learning models in interviews. I will also list important parameters related to each model and a source to find a detailed explanation of the same topic, so you can dig deeper if and when required.

classification, decision tree, important parameter concept, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.31)

Add feedback

LSCALE: Latent Space Clustering-Based Active Learning for Node Classification

Liu, Juncheng, Wang, Yiwei, Hooi, Bryan, Yang, Renchi, Xiao, Xiaokui

arXiv.org Artificial IntelligenceJul-20-2022

Node classification on graphs is an important task in many practical domains. It usually requires labels for training, which can be difficult or expensive to obtain in practice. Given a budget for labelling, active learning aims to improve performance by carefully choosing which nodes to label. Previous graph active learning methods learn representations using labelled nodes and select some unlabelled nodes for label acquisition. However, they do not fully utilize the representation power present in unlabelled nodes. We argue that the representation power in unlabelled nodes can be useful for active learning and for further improving performance of active learning for node classification. In this paper, we propose a latent space clustering-based active learning framework for node classification (LSCALE), where we fully utilize the representation power in both labelled and unlabelled nodes. Specifically, to select nodes for labelling, our framework uses the K-Medoids clustering algorithm on a latent space based on a dynamic combination of both unsupervised features and supervised features. In addition, we design an incremental clustering module to avoid redundancy between nodes selected at different steps. Extensive experiments on five datasets show that our proposed framework LSCALE consistently and significantly outperforms the stateof-the-art approaches by a large margin.

active learning, learning, node, (11 more...)

arXiv.org Artificial Intelligence

2012.07065

Country: Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Cancer Subtyping by Improved Transcriptomic Features Using Vector Quantized Variational Autoencoder

Chen, Zheng, Yang, Ziwei, Zhu, Lingwei, Shi, Guang, Yue, Kun, Matsubara, Takashi, Kanaya, Shigehiko, Altaf-Ul-Amin, MD

arXiv.org Artificial IntelligenceJul-20-2022

Defining and separating cancer subtypes is essential for facilitating personalized therapy modality and prognosis of patients. The definition of subtypes has been constantly recalibrated as a result of our deepened understanding. During this recalibration, researchers often rely on clustering of cancer data to provide an intuitive visual reference that could reveal the intrinsic characteristics of subtypes. The data being clustered are often omics data such as transcriptomics that have strong correlations to the underlying biological mechanism. However, while existing studies have shown promising results, they suffer from issues associated with omics data: sample scarcity and high dimensionality. As such, existing methods often impose unrealistic assumptions to extract useful features from the data while avoiding overfitting to spurious correlations. In this paper, we propose to leverage a recent strong generative model, Vector Quantized Variational AutoEncoder (VQ-VAE), to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering by retaining only information relevant to reconstructing the input. VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method. Extensive experiments and medical analysis on multiple datasets comprising 10 distinct cancers demonstrate the VQ-VAE clustering results can significantly and robustly improve prognosis over prevalent subtyping systems.

cancer, latent feature, subtype, (13 more...)

arXiv.org Artificial Intelligence

2207.09783

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
North America > United States > New York (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Over-the-Air Federated Edge Learning with Hierarchical Clustering

Aygün, Ozan, Kazemi, Mohammad, Gündüz, Deniz, Duman, Tolga M.

arXiv.org Artificial IntelligenceJul-19-2022

We examine federated learning (FL) with over-the-air (OTA) aggregation, where mobile users (MUs) aim to reach a consensus on a global model with the help of a parameter server (PS) that aggregates the local gradients. In OTA FL, MUs train their models using local data at every training round and transmit their gradients simultaneously using the same frequency band in an uncoded fashion. Based on the received signal of the superposed gradients, the PS performs a global model update. While the OTA FL has a significantly decreased communication cost, it is susceptible to adverse channel effects and noise. Employing multiple antennas at the receiver side can reduce these effects, yet the path-loss is still a limiting factor for users located far away from the PS. To ameliorate this issue, in this paper, we propose a wireless-based hierarchical FL scheme that uses intermediate servers (ISs) to form clusters at the areas where the MUs are more densely located. Our scheme utilizes OTA cluster aggregations for the communication of the MUs with their corresponding IS, and OTA global aggregations from the ISs to the PS. We present a convergence analysis for the proposed algorithm, and show through numerical evaluations of the derived analytical expressions and experimental results that utilizing ISs results in a faster convergence and a better performance than the OTA FL alone while using less transmit power. We also validate the results on the performance using different number of cluster iterations with different datasets and data distributions. We conclude that the best choice of cluster aggregations depends on the data distribution among the MUs and the clusters.

aggregation, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2207.09232

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Telecommunications (0.68)
Information Technology (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

A density peaks clustering algorithm with sparse search and K-d tree

Shan, Yunxiao, Li, Shu, Li, Fuxiang, Cui, Yuxin, Li, Shuai, Zhou, Ming, Li, Xiang

arXiv.org Artificial IntelligenceJul-19-2022

Density peaks clustering has become a nova of clustering algorithm because of its simplicity and practicality. However, there is one main drawback: it is time-consuming due to its high computational complexity. Herein, a density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem. Firstly, a sparse distance matrix is calculated by using K-d tree to replace the original full rank distance matrix, so as to accelerate the calculation of local density. Secondly, a sparse search strategy is proposed to accelerate the computation of relative-separation with the intersection between the set of $k$ nearest neighbors and the set consisting of the data points with larger local density for any data point. Furthermore, a second-order difference method for decision values is adopted to determine the cluster centers adaptively. Finally, experiments are carried out on datasets with different distribution characteristics, by comparing with other six state-of-the-art clustering algorithms. It is proved that the algorithm can effectively reduce the computational complexity of the original DPC from $O(n^2K)$ to $O(n(n^{1-1/K}+k))$. Especially for larger datasets, the efficiency is elevated more remarkably. Moreover, the clustering accuracy is also improved to a certain extent. Therefore, it can be concluded that the overall performance of the newly proposed algorithm is excellent.

algorithm, cluster center, dataset, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACCESS.2022.3190958

2203.00973

Country:

North America > United States > California > Orange County > Irvine (0.14)
Asia > China > Heilongjiang Province > Harbin (0.05)
Asia > China > Tianjin Province > Tianjin (0.04)
(4 more...)

Genre: Research Report (0.40)

Industry:

Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

An Overview of the scikit-learn Clustering Package

#artificialintelligenceJul-18-2022, 13:05:15 GMT

Clustering is an unsupervised Machine Learning technique, where there is neither a training set nor predefined classes. Clustering is used when there are many records, which should be grouped according to similarity criteria, such as distance. A clustering algorithm takes a dataset as input and returns a list of labels as output, corresponding to the associated clusters. Cluster analysis is an iterative process where, at each step, the current iteration is evaluated and used to feedback into changes to the algorithm in the next iteration, until the desired result is obtained. The scikit-learn library provides a subpackage, called sklearn.cluster, which provides the most common clustering algorithms.

algorithm, artificial intelligence, machine learning, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Add feedback

Identifying public values and spatial conflicts in urban planning

Herzog, Rico H., Gonçalves, Juliana E., Slingerland, Geertje, Kleinhans, Reinout, Prang, Holger, Brazier, Frances, Verma, Trivik

arXiv.org Artificial IntelligenceJul-18-2022

Identifying the diverse and often competing values of citizens, and resolving the consequent public value conflicts, are of significant importance for inclusive and integrated urban development. Scholars have highlighted that relational, value-laden urban space gives rise to many diverse conflicts that vary both spatially and temporally. Although notions of public value conflicts have been conceived in theory, there are very few empirical studies that identify such values and their conflicts in urban space. Building on public value theory and using a case-study mixed-methods approach, this paper proposes a new approach to empirically investigate public value conflicts in urban space. Using unstructured participatory data of 4,528 citizen contributions from a Public Participation Geographic Information Systems in Hamburg, Germany, natural language processing and spatial clustering techniques are used to identify areas of potential value conflicts. Four expert workshops assess and interpret these quantitative findings. Integrating both quantitative and qualitative results, 19 general public values and a total of 9 archetypical conflicts are identified. On the basis of these results, this paper proposes a new conceptual tool of Public Value Spheres that extends the theoretical notion of public-value conflicts and helps to further account for the value-laden nature of urban space.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1177/00420980231207475

2207.04719

Country:

Europe > Germany > Hamburg (0.25)
Europe > Germany > Saxony > Leipzig (0.05)
Europe > Netherlands > South Holland > Delft (0.04)
(7 more...)

Genre: Research Report (0.50)

Industry:

Law (1.00)
Government (0.68)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Discovering Behavioral Predispositions in Data to Improve Human Activity Recognition

Popko, Maximilian, Bader, Sebastian, Lüdtke, Stefan, Kirste, Thomas

arXiv.org Artificial IntelligenceJul-18-2022

The automatic, sensor-based assessment of challenging behavior of persons with dementia is an important task to support the selection of interventions. However, predicting behaviors like apathy and agitation is challenging due to the large inter- and intra-patient variability. Goal of this paper is to improve the recognition performance by making use of the observation that patients tend to show specific behaviors at certain times of the day or week. We propose to identify such segments of similar behavior via clustering the distributions of annotations of the time segments. All time segments within a cluster then consist of similar behaviors and thus indicate a behavioral predisposition (BPD). We utilize BPDs by training a classifier for each BPD. Empirically, we demonstrate that when the BPD per time segment is known, activity recognition performance can be substantially improved.

annotation, classifier, segment length, (12 more...)

arXiv.org Artificial Intelligence

2207.08816

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Dementia (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Understanding the Generalization Performance of Spectral Clustering Algorithms

Li, Shaojie, Ouyang, Sheng, Liu, Yong

arXiv.org Artificial IntelligenceJul-17-2022

The theoretical analysis of spectral clustering mainly focuses on consistency, while there is relatively little research on its generalization performance. In this paper, we study the excess risk bounds of the popular spectral clustering algorithms: \emph{relaxed} RatioCut and \emph{relaxed} NCut. Firstly, we show that their excess risk bounds between the empirical continuous optimal solution and the population-level continuous optimal solution have a $\mathcal{O}(1/\sqrt{n})$ convergence rate, where $n$ is the sample size. Secondly, we show the fundamental quantity in influencing the excess risk between the empirical discrete optimal solution and the population-level discrete optimal solution. At the empirical level, algorithms can be designed to reduce this quantity. Based on our theoretical analysis, we propose two novel algorithms that can not only penalize this quantity, but also cluster the out-of-sample data without re-eigendecomposition on the overall sample. Experiments verify the effectiveness of the proposed algorithms.

algorithm, optimal solution, spectral, (16 more...)

arXiv.org Artificial Intelligence

2205.00281

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.85)

Add feedback