AITopics

We introduce the ParClusterers Benchmark Suite (PCBS) -- a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms, which can be useful for fine-tuning the performance of clustering on a given task, and for comparing different clustering algorithms based on different metrics of interest, including clustering quality and running time. Using PCBS, we evaluate a broad collection of real-world graph clustering datasets. Somewhat surprisingly, we find that the best quality results are obtained by algorithms that not included in many popular graph clustering toolkits. The PCBS provides a standardized way to evaluate and judge the quality-performance tradeoffs of the active research area of scalable graph clustering algorithms. We believe it will help enable fair, accurate, and nuanced evaluation of graph clustering algorithms in the future.

artificial intelligence, data mining, machine learning, (19 more...)

2411.1029

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Halmos, Peter, Liu, Xinhao, Gold, Julian, Raphael, Benjamin J

Low-Rank Optimal Transport through Factor Relaxation with Latent Coupling

arXiv.org Machine LearningNov-15-2024

Optimal transport (OT) is a general framework for finding a minimum-cost transport plan, or coupling, between probability distributions, and has many applications in machine learning. A key challenge in applying OT to massive datasets is the quadratic scaling of the coupling matrix with the size of the dataset. [Forrow et al. 2019] introduced a factored coupling for the k-Wasserstein barycenter problem, which [Scetbon et al. 2021] adapted to solve the primal low-rank OT problem. We derive an alternative parameterization of the low-rank problem based on the $\textit{latent coupling}$ (LC) factorization previously introduced by [Lin et al. 2021] generalizing [Forrow et al. 2019]. The LC factorization has multiple advantages for low-rank OT including decoupling the problem into three OT problems and greater flexibility and interpretability. We leverage these advantages to derive a new algorithm $\textit{Factor Relaxation with Latent Coupling}$ (FRLC), which uses $\textit{coordinate}$ mirror descent to compute the LC factorization. FRLC handles multiple OT objectives (Wasserstein, Gromov-Wasserstein, Fused Gromov-Wasserstein), and marginal constraints (balanced, unbalanced, and semi-relaxed) with linear space complexity. We provide theoretical results on FRLC, and demonstrate superior performance on diverse applications -- including graph clustering and spatial transcriptomics -- while demonstrating its interpretability.

artificial intelligence, diag, machine learning, (19 more...)

arXiv.org Machine Learning

2411.10555

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)

Genre: Research Report (0.63)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Health & Medicine > Therapeutic Area (0.45)
Health & Medicine > Diagnostic Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Kousovista, Rania, Cosma, Georgina, Abakasanga, Emeka, Akbari, Ashley, Zaccardi, Francesco, Jun, Gyuchan Thomas, Kiani, Reza, Gangadharan, Satheesh

Temporal Patterns of Multiple Long-Term Conditions in Individuals with Intellectual Disability Living in Wales: An Unsupervised Clustering Approach to Disease Trajectories

Identifying and understanding the co-occurrence of multiple long-term conditions (MLTC) in individuals with intellectual disabilities (ID) is vital for effective healthcare management. These individuals often face earlier onset and higher prevalence of MLTCs, yet specific co-occurrence patterns remain unexplored. This study applies an unsupervised approach to characterise MLTC clusters based on shared disease trajectories using electronic health records (EHRs) from 13069 individuals with ID in Wales (2000-2021). Disease associations and temporal directionality were assessed, followed by spectral clustering to group shared trajectories. The population consisted of 52.3% males and 47.7% females, with an average of 4.5 conditions per patient. Males under 45 formed a single cluster dominated by neurological conditions (32.4%), while males above 45 had three clusters, the largest characterised circulatory (51.8%). Females under 45 formed one cluster with digestive conditions (24.6%) as most prevalent, while those aged 45 and older showed two clusters: one dominated by circulatory (34.1%), and the other by digestive (25.9%) and musculoskeletal (21.9%) system conditions. Mental illness, epilepsy, and reflux were common across groups. These clusters offer insights into disease progression in individuals with ID, informing targeted interventions and personalised healthcare strategies.

artificial intelligence, data mining, machine learning, (14 more...)

2411.08894

Country:

Europe > United Kingdom > England > Leicestershire > Loughborough (0.04)
Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.39)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Braverman, Vladimir, Dharangutte, Prathamesh, Pai, Shreyas, Shah, Vihan, Wang, Chen

Fully Dynamic Adversarially Robust Correlation Clustering in Polylogarithmic Update Time

We study the dynamic correlation clustering problem with $\textit{adaptive}$ edge label flips. In correlation clustering, we are given a $n$-vertex complete graph whose edges are labeled either $(+)$ or $(-)$, and the goal is to minimize the total number of $(+)$ edges between clusters and the number of $(-)$ edges within clusters. We consider the dynamic setting with adversarial robustness, in which the $\textit{adaptive}$ adversary could flip the label of an edge based on the current output of the algorithm. Our main result is a randomized algorithm that always maintains an $O(1)$-approximation to the optimal correlation clustering with $O(\log^{2}{n})$ amortized update time. Prior to our work, no algorithm with $O(1)$-approximation and $\text{polylog}{(n)}$ update time for the adversarially robust setting was known. We further validate our theoretical results with experiments on synthetic and real-world datasets with competitive empirical performances. Our main technical ingredient is an algorithm that maintains $\textit{sparse-dense decomposition}$ with $\text{polylog}{(n)}$ update time, which could be of independent interest.

algorithm, decomposition, vertex, (16 more...)

2411.09979

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(15 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

A Dual Adaptive Assignment Approach for Robust Graph-Based Clustering

Xiang, Yang, Fan, Li, Saha, Tulika, Pang, Xiaoying, Pan, Yushan, Zhang, Haiyang, Ji, Chengtao

Graph clustering is an essential aspect of network analysis that involves grouping nodes into separate clusters. Recent developments in deep learning have resulted in advanced deep graph clustering techniques, which have proven effective in many applications. Nonetheless, these methods often encounter difficulties when dealing with the complexities of real-world graphs, particularly in the presence of noisy edges. Additionally, many denoising graph clustering strategies tend to suffer from lower performance compared to their non-denoised counterparts, training instability, and challenges in scaling to large datasets. To tackle these issues, we introduce a new framework called the Dual Adaptive Assignment Approach for Robust Graph-Based Clustering (RDSA). RDSA consists of three key components: (i) a node embedding module that effectively integrates the graph's topological features and node attributes; (ii) a structure-based soft assignment module that improves graph modularity by utilizing an affinity matrix for node assignments; and (iii) a node-based soft assignment module that identifies community landmarks and refines node assignments to enhance the model's robustness. We assess RDSA on various real-world datasets, demonstrating its superior performance relative to existing state-of-the-art methods. Our findings indicate that RDSA provides robust clustering across different graph types, excelling in clustering effectiveness and robustness, including adaptability to noise, stability, and scalability.

dataset, graph, node, (15 more...)

2410.21745

Country:

North America > United States > District of Columbia > Washington (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Guruprasad, Pranav, Mokhberian, Negar, Varghese, Nikhil, Khatri, Chandra, Kelkar, Amol

KULCQ: An Unsupervised Keyword-based Utterance Level Clustering Quality Metric

Intent discovery is crucial for both building new conversational agents and improving existing ones. While several approaches have been proposed for intent discovery, most rely on clustering to group similar utterances together. Traditional evaluation of these utterance clusters requires intent labels for each utterance, limiting scalability. Although some clustering quality metrics exist that do not require labeled data, they focus solely on cluster geometry while ignoring the linguistic nuances present in conversational transcripts. In this paper, we introduce Keyword-based Utterance Level Clustering Quality (KULCQ), an unsupervised metric that leverages keyword analysis to evaluate clustering quality. We demonstrate KULCQ's effectiveness by comparing it with existing unsupervised clustering metrics and validate its performance through comprehensive ablation studies. Our results show that KULCQ better captures semantic relationships in conversational data while maintaining consistency with geometric clustering principles.

machine learning, natural language, utterance, (17 more...)

2411.09853

Country:

North America > United States > California (0.14)
Europe > Germany > Saarland > Saarbrücken (0.04)
Europe > Belgium (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Gu, Yuqi, Lyu, Zhongyuan, Wang, Kaizheng

Adaptive Transfer Clustering: A Unified Framework

arXiv.org Machine LearningNov-14-2024

We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios.

artificial intelligence, machine learning, probability, (17 more...)

arXiv.org Machine Learning

2410.21263

Country:

North America > United States (0.28)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.63)

Industry:

Education (0.92)
Government (0.67)
Health & Medicine > Therapeutic Area > Neurology (0.67)
Health & Medicine > Diagnostic Medicine > Imaging (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)

Revealing the Evolution of Order in Materials Microstructures Using Multi-Modal Computer Vision

Ter-Petrosyan, Arman, Holden, Michael, Bilbrey, Jenna A., Akers, Sarah, Doty, Christina, Yano, Kayla H., Wang, Le, Paudel, Rajendra, Lang, Eric, Hattar, Khalid, Comes, Ryan B., Du, Yingge, Matthews, Bethany E., Spurgeon, Steven R.

The development of high-performance materials for microelectronics, energy storage, and extreme environments depends on our ability to describe and direct property-defining microstructural order. Our present understanding is typically derived from laborious manual analysis of imaging and spectroscopy data, which is difficult to scale, challenging to reproduce, and lacks the ability to reveal latent associations needed for mechanistic models. Here, we demonstrate a multi-modal machine learning (ML) approach to describe order from electron microscopy analysis of the complex oxide La$_{1-x}$Sr$_x$FeO$_3$. We construct a hybrid pipeline based on fully and semi-supervised classification, allowing us to evaluate both the characteristics of each data modality and the value each modality adds to the ensemble. We observe distinct differences in the performance of uni- and multi-modal models, from which we draw general lessons in describing crystal order using computer vision.

artificial intelligence, evolution, machine learning, (19 more...)

2411.09896

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.14)
North America > United States > Delaware > New Castle County > Newark (0.14)
North America > United States > Colorado > Boulder County > Boulder (0.14)
(7 more...)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Energy > Renewable (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Üstek, İrem, Arana-Catania, Miguel, Farr, Alexander, Petrunin, Ivan

Deep Autoencoders for Unsupervised Anomaly Detection in Wildfire Prediction

Wildfires pose a significantly increasing hazard to global ecosystems due to the climate crisis. Due to its complex nature, there is an urgent need for innovative approaches to wildfire prediction, such as machine learning. This research took a unique approach, differentiating from classical supervised learning, and addressed the gap in unsupervised wildfire prediction using autoencoders and clustering techniques for anomaly detection. Historical weather and normalised difference vegetation index datasets of Australia for 2005 - 2021 were utilised. Two main unsupervised approaches were analysed. The first used a deep autoencoder to obtain latent features, which were then fed into clustering models, isolation forest, local outlier factor and one-class SVM for anomaly detection. The second approach used a deep autoencoder to reconstruct the input data and use reconstruction errors to identify anomalies. Long Short-Term Memory (LSTM) autoencoders and fully connected (FC) autoencoders were employed in this part, both in an unsupervised way learning only from nominal data. The FC autoencoder outperformed its counterparts, achieving an accuracy of 0.71, an F1-score of 0.74, and an MCC of 0.42. These findings highlight the practicality of this method, as it effectively predicts wildfires in the absence of ground truth, utilising an unsupervised learning technique.

autoencoder, data mining, machine learning, (16 more...)

2411.09844

Country:

North America > Canada (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
(15 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.93)
Information Technology (0.68)
Government > Regional Government > North America Government > United States Government (0.68)
Energy (0.68)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Partial Multi-View Clustering via Meta-Learning and Contrastive Feature Alignment

Chen, BoHao

Partial multi-view clustering (PVC) presents significant challenges practical research problem for data analysis in real-world applications, especially when some views of the data are partially missing. Existing clustering methods struggle to handle incomplete views effectively, leading to suboptimal clustering performance. In this paper, we propose a novel dual optimization framework based on contrastive learning, which aims to maximize the consistency of latent features in incomplete multi-view data and improve clustering performance through deep learning models. By combining a fine-tuned Vision Transformer and k-nearest neighbors (KNN), we fill in missing views and dynamically adjust view weights using self-supervised learning and meta-learning. Experimental results demonstrate that our framework outperforms state-of-the-art clustering models on the BDGP and HW datasets, particularly in handling complex and incomplete multi-view data.

artificial intelligence, machine learning, representation, (19 more...)

2411.09758

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)