AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

TreeCSS: An Efficient Framework for Vertical Federated Learning

Zhang, Qinbo, Yan, Xiao, Ding, Yukai, Xu, Quanqing, Hu, Chuang, Zhou, Xiaokai, Jiang, Jiawei

arXiv.org Artificial IntelligenceAug-3-2024

Vertical federated learning (VFL) considers the case that the features of data samples are partitioned over different participants. VFL consists of two main steps, i.e., identify the common data samples for all participants (alignment) and train model using the aligned data samples (training). However, when there are many participants and data samples, both alignment and training become slow. As such, we propose TreeCSS as an efficient VFL framework that accelerates the two main steps. In particular, for sample alignment, we design an efficient multi-party private set intersection (MPSI) protocol called Tree-MPSI, which adopts a tree-based structure and a data-volume-aware scheduling strategy to parallelize alignment among the participants. As model training time scales with the number of data samples, we conduct coreset selection (CSS) to choose some representative data samples for training. Our CCS method adopts a clustering-based scheme for security and generality, which first clusters the features locally on each participant and then merges the local clustering results to select representative samples. In addition, we weight the samples according to their distances to the centroids to reflect their importance to model training. We evaluate the effectiveness and efficiency of our TreeCSS framework on various datasets and models. The results show that compared with vanilla VFL, TreeCSS accelerates training by up to 2.93x and achieves comparable model accuracy.

data sample, dataset, participant, (14 more...)

arXiv.org Artificial Intelligence

2408.01691

Country: Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Transforming Slot Schema Induction with Generative Dialogue State Inference

Finch, James D., Zhao, Boxin, Choi, Jinho D.

arXiv.org Artificial IntelligenceAug-2-2024

The challenge of defining a slot schema to represent the state of a task-oriented dialogue system is addressed by Slot Schema Induction (SSI), which aims to automatically induce slots from unlabeled dialogue data. Whereas previous approaches induce slots by clustering value spans extracted directly from the dialogue text, we demonstrate the power of discovering slots using a generative approach. By training a model to generate slot names and values that summarize key dialogue information with no prior task knowledge, our SSI method discovers high-quality candidate information for representing dialogue state. These discovered slot-value candidates can be easily clustered into unified slot schemas that align well with human-authored schemas. Experimental comparisons on the MultiWOZ and SGD datasets demonstrate that Generative Dialogue State Inference (GenDSI) outperforms the previous state-of-the-art on multiple aspects of the SSI task.

evaluation, gold slot, information, (11 more...)

arXiv.org Artificial Intelligence

2408.01638

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A Dirichlet stochastic block model for composition-weighted networks

Promskaia, Iuliia, O'Hagan, Adrian, Fop, Michael

arXiv.org Machine LearningAug-1-2024

Network data are observed in various applications where the individual entities of the system interact with or are connected to each other, and often these interactions are defined by their associated strength or importance. Clustering is a common task in network analysis that involves finding groups of nodes displaying similarities in the way they interact with the rest of the network. However, most clustering methods use the strengths of connections between entities in their original form, ignoring the possible differences in the capacities of individual nodes to send or receive edges. This often leads to clustering solutions that are heavily influenced by the nodes' capacities. One way to overcome this is to analyse the strengths of connections in relative rather than absolute terms, expressing each edge weight as a proportion of the sending (or receiving) capacity of the respective node. This, however, induces additional modelling constraints that most existing clustering methods are not designed to handle. In this work we propose a stochastic block model for composition-weighted networks based on direct modelling of compositional weight vectors using a Dirichlet mixture, with the parameters determined by the cluster labels of the sender and the receiver nodes. Inference is implemented via an extension of the classification expectation-maximisation algorithm that uses a working independence assumption, expressing the complete data likelihood of each node of the network as a function of fixed cluster labels of the remaining nodes. A model selection criterion is derived to aid the choice of the number of clusters. The model is validated using simulation studies, and showcased on network data from the Erasmus exchange program and a bike sharing network for the city of London.

algorithm, node, stochastic block model, (15 more...)

arXiv.org Machine Learning

2408.00651

Country:

Europe > Germany (0.05)
Europe > France (0.05)
Europe > Spain (0.04)
(30 more...)

Genre: Research Report (0.50)

Industry:

Education (1.00)
Information Technology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data

Hartung, Michael, Maier, Andreas, Delgado-Chaves, Fernando, Burankova, Yuliya, Isaeva, Olga I., Patroni, Fábio Malta de Sá, He, Daniel, Shannon, Casey, Kaufmann, Katharina, Lohmann, Jens, Savchik, Alexey, Hartebrodt, Anne, Chervontseva, Zoe, Firoozbakht, Farzaneh, Probul, Niklas, Zotova, Evgenia, Tsoy, Olga, Blumenthal, David B., Ester, Martin, Laske, Tanja, Baumbach, Jan, Zolotareva, Olga

arXiv.org Artificial IntelligenceJul-31-2024

Most complex diseases, including cancer and non-malignant diseases like asthma, have distinct molecular subtypes that require distinct clinical approaches. However, existing computational patient stratification methods have been benchmarked almost exclusively on cancer omics data and only perform well when mutually exclusive subtypes can be characterized by many biomarkers. Here, we contribute with a massive evaluation attempt, quantitatively exploring the power of 22 unsupervised patient stratification methods using both, simulated and real transcriptome data. From this experience, we developed UnPaSt (https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification, working even with only a limited number of subtype-predictive biomarkers. We evaluated all 23 methods on real-world breast cancer and asthma transcriptomics data. Although many methods reliably detected major breast cancer subtypes, only few identified Th2-high asthma, and UnPaSt significantly outperformed its closest competitors in both test datasets. Essentially, we showed that UnPaSt can detect many biologically insightful and reproducible patterns in omic datasets.

bicluster, dataset, subtype, (15 more...)

arXiv.org Artificial Intelligence

2408.002

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(10 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.57)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)

Add feedback

Areas of Improvement for Autonomous Vehicles: A Machine Learning Analysis of Disengagement Reports

Ward, Tyler

arXiv.org Artificial IntelligenceJul-31-2024

Since 2014, the California Department of Motor Vehicles (CDMV) has compiled information from manufacturers of autonomous vehicles (AVs) regarding factors that lead to the disengagement from autonomous driving mode in these vehicles. These disengagement reports (DRs) contain information detailing whether the AV disengaged from autonomous mode due to technology failure, manual override, or other factors during driving tests. This paper presents a machine learning (ML) based analysis of the information from the 2023 DRs. We use a natural language processing (NLP) approach to extract important information from the description of a disengagement, and use the k-Means clustering algorithm to group report entries together. The cluster frequency is then analyzed, and each cluster is manually categorized based on the factors leading to disengagement. We discuss findings from previous years' DRs, and provide our own analysis to identify areas of improvement for AVs.

disengagement, information, vehicle, (14 more...)

arXiv.org Artificial Intelligence

2408.00051

Country:

North America > United States > California (0.35)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)
Government (0.89)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Temporal Subspace Clustering for Molecular Dynamics Data

Beer, Anna, Heinrigs, Martin, Plant, Claudia, Assent, Ira

arXiv.org Artificial IntelligenceJul-31-2024

We introduce MOSCITO (MOlecular Dynamics Subspace Clustering with Temporal Observance), a subspace clustering for molecular dynamics data. MOSCITO groups those timesteps of a molecular dynamics trajectory together into clusters in which the molecule has similar conformations. In contrast to state-of-the-art methods, MOSCITO takes advantage of sequential relationships found in time series data. Unlike existing work, MOSCITO does not need a two-step procedure with tedious post-processing, but directly models essential properties of the data. Interpreting clusters as Markov states allows us to evaluate the clustering performance based on the resulting Markov state models. In experiments on 60 trajectories and 4 different proteins, we show that the performance of MOSCITO achieves state-of-the-art performance in a novel single-step method. Moreover, by modeling temporal aspects, MOSCITO obtains better segmentation of trajectories, especially for small numbers of clusters.

moscito, subspace, trajectory, (14 more...)

arXiv.org Artificial Intelligence

2408.00056

Country:

Europe > Austria > Vienna (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > District of Columbia > Washington (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Mathematics of Computing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Adaptive Self-supervised Robust Clustering for Unstructured Data with Unknown Cluster Number

Ding, Chen-Lu, Wu, Jiancan, Lin, Wei, Shen, Shiyang, Wang, Xiang, Yuan, Yancheng

arXiv.org Artificial IntelligenceJul-30-2024

We introduce a novel self-supervised deep clustering approach tailored for unstructured data without requiring prior knowledge of the number of clusters, termed Adaptive Self-supervised Robust Clustering (ASRC). In particular, ASRC adaptively learns the graph structure and edge weights to capture both local and global structural information. The obtained graph enables us to learn clustering-friendly feature representations by an enhanced graph auto-encoder with contrastive learning technique. It further leverages the clustering results adaptively obtained by robust continuous clustering (RCC) to generate prototypes for negative sampling, which can further contribute to promoting consistency among positive pairs and enlarging the gap between positive and negative samples. ASRC obtains the final clustering results by applying RCC to the learned feature representations with their consistent graph structure and edge weights. Extensive experiments conducted on seven benchmark datasets demonstrate the efficacy of ASRC, demonstrating its superior performance over other popular clustering models. Notably, ASRC even outperforms methods that rely on prior knowledge of the number of clusters, highlighting its effectiveness in addressing the challenges of clustering unstructured data.

convex, graph structure, representation, (15 more...)

arXiv.org Artificial Intelligence

2407.20119

Country:

North America > United States (0.28)
Asia > China > Jiangsu Province > Yancheng (0.05)
Asia > China > Hong Kong (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Understanding Public Safety Trends in Calgary through data mining

Dewis, Zack, Sen, Apratim, Wong, Jeffrey, Zhang, Yujia

arXiv.org Artificial IntelligenceJul-30-2024

This paper utilizes statistical data from various open datasets in Calgary to to uncover patterns and insights for community crimes, disorders, and traffic incidents. Community attributes like demographics, housing, and pet registration were collected and analyzed through geospatial visualization and correlation analysis. Strongly correlated features were identified using the chi-square test, and predictive models were built using association rule mining and machine learning algorithms. The findings suggest that crime rates are closely linked to factors such as population density, while pet registration has a smaller impact. This study offers valuable insights for city managers to enhance community safety strategies.

calgary, dataset, incident, (16 more...)

arXiv.org Artificial Intelligence

2407.21163

Country: North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.77)

Genre:

Research Report > New Finding (0.88)
Research Report > Experimental Study (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Education (0.94)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Automated Quantification of Hyperreflective Foci in SD-OCT With Diabetic Retinopathy

Okuwobi, Idowu Paul, Ji, Zexuan, Fan, Wen, Yuan, Songtao, Bekalo, Loza, Chen, Qiang

arXiv.org Artificial IntelligenceJul-30-2024

The presence of hyperreflective foci (HFs) is related to retinal disease progression, and the quantity has proven to be a prognostic factor of visual and anatomical outcome in various retinal diseases. However, lack of efficient quantitative tools for evaluating the HFs has deprived ophthalmologist of assessing the volume of HFs. For this reason, we propose an automated quantification algorithm to segment and quantify HFs in spectral domain optical coherence tomography (SD-OCT). The proposed algorithm consists of two parallel processes namely: region of interest (ROI) generation and HFs estimation. To generate the ROI, we use morphological reconstruction to obtain the reconstructed image and histogram constructed for data distributions and clustering. In parallel, we estimate the HFs by extracting the extremal regions from the connected regions obtained from a component tree. Finally, both the ROI and the HFs estimation process are merged to obtain the segmented HFs. The proposed algorithm was tested on 40 3D SD-OCT volumes from 40 patients diagnosed with non-proliferative diabetic retinopathy (NPDR), proliferative diabetic retinopathy (PDR), and diabetic macular edema (DME). The average dice similarity coefficient (DSC) and correlation coefficient (r) are 69.70%, 0.99 for NPDR, 70.31%, 0.99 for PDR, and 71.30%, 0.99 for DME, respectively. The proposed algorithm can provide ophthalmologist with good HFs quantitative information, such as volume, size, and location of the HFs.

algorithm, ground truth, pixel, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/JBHI.2019.2929842

2407.21272

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

Köksal, Çağhan, Ghazaei, Ghazal, Holm, Felix, Farshad, Azade, Navab, Nassir

arXiv.org Artificial IntelligenceJul-29-2024

Graph-based holistic scene representations facilitate surgical workflow understanding and have recently demonstrated significant success. However, this task is often hindered by the limited availability of densely annotated surgical scene data. In this work, we introduce an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task. Our approach leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties. We reinforce the initial spatial graph with sparse temporal connections using local matches between consecutive frames to predict temporally consistent clusters across a temporal neighborhood. By jointly optimizing the spatiotemporal relations and node features of the dynamic scene graph with the downstream task of phase segmentation, we address the costly and annotation-burdensome task of semantic scene comprehension and scene graph generation in surgical videos using only weak surgical phase labels. Further, by incorporating effective intermediate scene representation disentanglement steps within the pipeline, our solution outperforms the SOTA on the CATARACTS dataset by 8% accuracy and 10% F1 score in surgical workflow recognition

graph, international conference, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2407.20214

Country:

South America > Peru > Lima Department > Lima Province > Lima (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre:

Workflow (0.82)
Research Report (0.64)

Industry:

Health & Medicine > Surgery (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback