AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Joint Gradient and Loss Based Clustered Federated Learning Design

Lin, Licheng, Chen, Mingzhe, Yang, Zhaohui, Wu, Yusen, Liu, Yuchen

arXiv.org Artificial IntelligenceNov-22-2023

In this paper, a novel clustered FL framework that enables distributed edge devices with non-IID data to independently form several clusters in a distributed manner and implement FL training within each cluster is proposed. In particular, our designed clustered FL algorithm must overcome two challenges associated with FL training. First, the server has limited FL training information (i.e., the parameter server can only obtain the FL model information of each device) and limited computational power for finding the differences among a large amount of devices. Second, each device does not have the data information of other devices for device clustering and can only use global FL model parameters received from the server and its data information to determine its cluster identity, which will increase the difficulty of device clustering. To overcome these two challenges, we propose a joint gradient and loss based distributed clustering method in which each device determines its cluster identity considering the gradient similarity and training loss. The proposed clustering method not only considers how a local FL model of one device contributes to each cluster but also the direction of gradient descent thus improving clustering speed. By delegating clustering decisions to edge devices, each device can fully leverage its private data information to determine its own cluster identity, thereby reducing clustering overhead and improving overall clustering performance. Simulation results demonstrate that our proposed clustered FL algorithm can reduce clustering iterations by up to 99% compared to the existing baseline.

algorithm, cluster identity, information, (16 more...)

arXiv.org Artificial Intelligence

2311.13665

Country:

North America > United States > Florida > Miami-Dade County > Coral Gables (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
(6 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective

Yuan, Hao, Liu, Yajiong, Zhang, Yanfeng, Ai, Xin, Wang, Qiange, Chen, Chaoyi, Gu, Yu, Yu, Ge

arXiv.org Artificial IntelligenceNov-22-2023

Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN training in data management, such as data partitioning, batch preparation for mini-batch training, and data transferring between CPUs and GPUs. These factors, which take up a large proportion of training time, make data management in GNN training more significant. This paper reviews GNN training from a data management perspective and provides a comprehensive analysis and evaluation of the representative approaches. We conduct extensive experiments on various benchmark datasets and show many interesting and valuable results. We also provide some practical tips learned from these experiments, which are helpful for designing GNN training systems in the future.

gnn training, graph, vertex, (13 more...)

arXiv.org Artificial Intelligence

2311.13279

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(22 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.46)
Energy (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time Adaptation

Sreenivas, Manogna, Chakrabarty, Goirik, Biswas, Soma

arXiv.org Artificial IntelligenceNov-22-2023

Test Time Adaptation (TTA) is a pivotal concept in machine learning, enabling models to perform well in real-world scenarios, where test data distribution differs from training. In this work, we propose a novel approach called pseudo Source guided Target Clustering (pSTarC) addressing the relatively unexplored area of TTA under real-world domain shifts. This method draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples. The test samples are strategically aligned with these pseudo-source samples, facilitating their clustering and thereby enhancing TTA performance. pSTarC operates solely within the fully test-time adaptation protocol, removing the need for actual source data. Experimental validation on a variety of domain shift datasets, namely VisDA, Office-Home, DomainNet-126, CIFAR-100C verifies pSTarC's effectiveness. This method exhibits significant improvements in prediction accuracy along with efficient computational requirements. Furthermore, we also demonstrate the universality of the pSTarC framework by showing its effectiveness for the continuous TTA framework. The source code for our method is available at https://manogna-s.github.io/pstarc

adaptation, dataset, pstarc, (15 more...)

arXiv.org Artificial Intelligence

2309.00846

Country: Asia > India > Karnataka > Bengaluru (0.04)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Orchard: building large cancer phylogenies using stochastic combinatorial search

Kulman, E., Kuang, R., Morris, Q.

arXiv.org Artificial IntelligenceNov-21-2023

Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer i.e., cancer phylogenies, provide useful insights about cancer development and inform treatment. Cancer phylogenies can be reconstructed using data obtained from bulk DNA sequencing of multiple tissue samples from the same cancer. We introduce Orchard, a fast algorithm that reconstructs cancer phylogenies using point mutations detected in bulk DNA sequencing data. Orchard constructs cancer phylogenies progressively, one point mutation at a time, ultimately sampling complete phylogenies from a posterior distribution implied by the bulk DNA data. Orchard reconstructs more plausible phylogenies than state-of-the-art cancer phylogeny reconstruction methods on 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). These results demonstrate that Orchard accurately reconstructs cancer phylogenies with up to 300 mutations. We then introduce a simple graph based clustering algorithm that uses a reconstructed phylogeny to infer unique groups of mutations i.e., mutation clusters, that characterize the genetic differences between cancer cell populations, and show that this approach is competitive with state-of-the-art mutation clustering methods.

algorithm, mutation, orchard, (15 more...)

arXiv.org Artificial Intelligence

2311.12917

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
North America > United States > Massachusetts (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Fair Polylog-Approximate Low-Cost Hierarchical Clustering

Knittel, Marina, Springer, Max, Dickerson, John, Hajiaghayi, MohammadTaghi

arXiv.org Artificial IntelligenceNov-21-2023

Research in fair machine learning, and particularly clustering, has been crucial in recent years given the many ethical controversies that modern intelligent systems have posed. Ahmadian et al. [2020] established the study of fairness in \textit{hierarchical} clustering, a stronger, more structured variant of its well-known flat counterpart, though their proposed algorithm that optimizes for Dasgupta's [2016] famous cost function was highly theoretical. Knittel et al. [2023] then proposed the first practical fair approximation for cost, however they were unable to break the polynomial-approximate barrier they posed as a hurdle of interest. We break this barrier, proposing the first truly polylogarithmic-approximate low-cost fair hierarchical clustering, thus greatly bridging the gap between the best fair and vanilla hierarchical clustering approximations.

algorithm, fairness, vertex, (16 more...)

arXiv.org Artificial Intelligence

2311.12501

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

SpecHD: Hyperdimensional Computing Framework for FPGA-based Mass Spectrometry Clustering

Pinge, Sumukh, Xu, Weihong, Kang, Jaeyoung, Zhang, Tianqi, Moshiri, Neima, Bittremieux, Wout, Rosing, Tajana

arXiv.org Artificial IntelligenceNov-20-2023

Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the domain of spectral clustering. To tackle these inefficiencies, we introduce SpecHD, a hyperdimensional computing (HDC) framework supplemented by an FPGA-accelerated architecture with integrated near-storage preprocessing. Utilizing streamlined binary operations in an HDC environment, SpecHD capitalizes on the low-latency and parallel capabilities of FPGAs. This approach markedly improves clustering speed and efficiency, serving as a catalyst for real-time, high-throughput data analysis in future healthcare applications. Our evaluations demonstrate that SpecHD not only maintains but often surpasses existing clustering quality metrics while drastically cutting computational time. Specifically, it can cluster a large-scale human proteome dataset-comprising 25 million MS/MS spectra and 131 GB of MS data-in just 5 minutes. With energy efficiency exceeding 31x and a speedup factor that spans a range of 6x to 54x over existing state of-the-art solutions, SpecHD emerges as a promising solution for the rapid analysis of mass spectrometry data with great implications for personalized healthcare.

architecture, efficiency, spectra, (15 more...)

arXiv.org Artificial Intelligence

2311.12874

Country:

Africa > Chad > Salamat (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Hardware (0.95)
Information Technology > Information Management (0.93)
(3 more...)

Add feedback

Establishing Central Sensitization Inventory Cut-off Values in patients with Chronic Low Back Pain by Unsupervised Machine Learning

Zheng, Xiaoping, Lamoth, Claudine JC, Timmerman, Hans, Otten, Ebert, Reneman, Michiel F

arXiv.org Artificial IntelligenceNov-20-2023

Human Assumed Central Sensitization is involved in the development and maintenance of chronic low back pain (CLBP). The Central Sensitization Inventory (CSI) was developed to evaluate the presence of HACS, with a cut-off value of 40/100 based on patients with chronic pain. However, various factors including pain conditions (e.g., CLBP), and gender may influence this cut-off value. For chronic pain condition such as CLBP, unsupervised clustering approaches can take these factors into consideration and automatically learn the HACS-related patterns. Therefore, this study aimed to determine the cut-off values for a Dutch-speaking population with CLBP, considering the total group and stratified by gender based on unsupervised machine learning. In this study, questionnaire data covering pain, physical, and psychological aspects were collected from patients with CLBP and aged-matched pain-free adults (referred to as healthy controls, HC). Four clustering approaches were applied to identify HACS-related clusters based on the questionnaire data and gender. The clustering performance was assessed using internal and external indicators. Subsequently, receiver operating characteristic analysis was conducted on the best clustering results to determine the optimal cut-off values. The study included 151 subjects, consisting of 63 HCs and 88 patients with CLBP. Hierarchical clustering yielded the best results, identifying three clusters: healthy group, CLBP with low HACS level, and CLBP with high HACS level groups. Based on the low HACS levels group (including HC and CLBP with low HACS level) and high HACS level group, the cut-off value for the overall groups were 35, 34 for females, and 35 for. The findings suggest that the optimal cut-off values for CLBP is 35. The gender-related cut-off values should be interpreted with caution due to the unbalanced gender distribution in the sample.

cut-off value, data sample, hac, (12 more...)

arXiv.org Artificial Intelligence

2311.11862

Country:

Europe > Netherlands (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Online Arbitrary Shaped Clustering through Correlated Gaussian Functions

Eidheim, Ole Christian

arXiv.org Artificial IntelligenceNov-20-2023

There is no convincing evidence that backpropagation is a biologically plausible mechanism, and further studies of alternative learning methods are needed. A novel online clustering algorithm is presented that can produce arbitrary shaped clusters from inputs in an unsupervised manner, and requires no prior knowledge of the number of clusters in the input data. This is achieved by finding correlated outputs from functions that capture commonly occurring input patterns. The algorithm can be deemed more biologically plausible than model optimization through backpropagation, although practical applicability may require additional research. However, the method yields satisfactory results on several toy datasets on a noteworthy range of hyperparameters.

dataset, gaussian function, neuron, (11 more...)

arXiv.org Artificial Intelligence

2302.06335

Country: Europe > Norway > Central Norway > Trøndelag > Trondheim (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Data thinning for convolution-closed distributions

Neufeld, Anna, Dharamshi, Ameer, Gao, Lucy L., Witten, Daniela

arXiv.org Machine LearningNov-20-2023

We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.

artificial intelligence, machine learning, sample splitting, (16 more...)

arXiv.org Machine Learning

2301.07276

Country:

North America > United States (0.46)
North America > Canada > British Columbia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.81)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

Gromov, Vasilii, Dang, Quynh Nhu

arXiv.org Artificial IntelligenceNov-19-2023

With the development of generative models like GPT-3, it is increasingly more challenging to differentiate generated texts from human-written ones. There is a large number of studies that have demonstrated good results in bot identification. However, the majority of such works depend on supervised learning methods that require labelled data and/or prior knowledge about the bot-model architecture. In this work, we propose a bot identification algorithm that is based on unsupervised learning techniques and does not depend on a large amount of labelled data. By combining findings in semantic analysis by clustering (crisp and fuzzy) and information techniques, we construct a robust model that detects a generated text for different types of bot. We find that the generated texts tend to be more chaotic while literary works are more complex. We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.

algorithm, bot, distinguishing human-written, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-45170-6_3

2311.11441

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Russia (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback