Goto

Collaborating Authors

 Clustering


Federated Learning for Short Text Clustering

arXiv.org Artificial Intelligence

Short text clustering has been popularly studied for its significance in mining valuable insights from many short texts. In this paper, we focus on the federated short text clustering (FSTC) problem, i.e., clustering short texts that are distributed in different clients, which is a realistic problem under privacy requirements. Compared with the centralized short text clustering problem that short texts are stored on a central server, the FSTC problem has not been explored yet. To fill this gap, we propose a Federated Robust Short Text Clustering (FSTC) framework. FSTC includes two main modules, i.e., robust short text clustering module and federated cluster center aggregation module. The robust short text clustering module aims to train an effective short text clustering model with local data in each client. We innovatively combine optimal transport to generate pseudo-labels with Gaussian-uniform mixture model to ensure the reliability of the pseudo-supervised data. The federated cluster center aggregation module aims to exchange knowledge across clients without sharing local raw data in an efficient way. The server aggregates the local cluster centers from different clients and then sends the global centers back to all clients in each communication round. Our empirical studies on three short text clustering datasets demonstrate that FSTC significantly outperforms the federated short text clustering baselines.


Learning Uniform Clusters on Hypersphere for Deep Graph-level Clustering

arXiv.org Artificial Intelligence

Graph clustering has been popularly studied in recent years. However, most existing graph clustering methods focus on node-level clustering, i.e., grouping nodes in a single graph into clusters. In contrast, graph-level clustering, i.e., grouping multiple graphs into clusters, remains largely unexplored. Graph-level clustering is critical in a variety of real-world applications, such as, properties prediction of molecules and community analysis in social networks. However, graph-level clustering is challenging due to the insufficient discriminability of graph-level representations, and the insufficient discriminability makes deep clustering be more likely to obtain degenerate solutions (cluster collapse). To address the issue, we propose a novel deep graph-level clustering method called Uniform Deep Graph Clustering (UDGC). UDGC assigns instances evenly to different clusters and then scatters those clusters on unit hypersphere, leading to a more uniform cluster-level distribution and a slighter cluster collapse. Specifically, we first propose Augmentation-Consensus Optimal Transport (ACOT) for generating uniformly distributed and reliable pseudo labels for partitioning clusters. Then we adopt contrastive learning to scatter those clusters. Besides, we propose Center Alignment Optimal Transport (CAOT) for guiding the model to learn better parameters, which further promotes the cluster performance. Our empirical study on eight well-known datasets demonstrates that UDGC significantly outperforms the state-of-the-art models.


Strategies for Parallelizing the Big-Means Algorithm: A Comprehensive Tutorial for Effective Big Data Clustering

arXiv.org Artificial Intelligence

This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.


Exploring the impact of social stress on the adaptive dynamics of COVID-19: Typing the behavior of na\"ive populations faced with epidemics

arXiv.org Artificial Intelligence

In the context of natural disasters, human responses inevitably intertwine with natural factors. The COVID-19 pandemic, as a significant stress factor, has brought to light profound variations among different countries in terms of their adaptive dynamics in addressing the spread of infection outbreaks across different regions. This emphasizes the crucial role of cultural characteristics in natural disaster analysis. The theoretical understanding of large-scale epidemics primarily relies on mean-field kinetic models. However, conventional SIR-like models failed to fully explain the observed phenomena at the onset of the COVID-19 outbreak. These phenomena included the unexpected cessation of exponential growth, the reaching of plateaus, and the occurrence of multi-wave dynamics. In situations where an outbreak of a highly virulent and unfamiliar infection arises, it becomes crucial to respond swiftly at a non-medical level to mitigate the negative socio-economic impact. Here we present a theoretical examination of the first wave of the epidemic based on a simple SIRSS model (SIR with Social Stress). We conduct an analysis of the socio-cultural features of na\"ive population behaviors across various countries worldwide. The unique characteristics of each country/territory are encapsulated in only a few constants within our model, derived from the fitted COVID-19 statistics. These constants also reflect the societal response dynamics to the external stress factor, underscoring the importance of studying the mutual behavior of humanity and natural factors during global social disasters. Based on these distinctive characteristics of specific regions, local authorities can optimize their strategies to effectively combat epidemics until vaccines are developed.


Unsupervised Learning for Topological Classification of Transportation Networks

arXiv.org Artificial Intelligence

With increasing urbanization, transportation plays an increasingly critical role in city development. The number of studies on modeling, optimization, simulation, and data analysis of transportation systems is on the rise. Many of these studies utilize transportation test networks to represent real-world transportation systems in urban areas, examining the efficacy of their proposed approaches. Each of these networks exhibits unique characteristics in their topology, making their applications distinct for various study objectives. Despite their widespread use in research, there is a lack of comprehensive study addressing the classification of these networks based on their topological characteristics. This study aims to fill this gap by employing unsupervised learning methods, particularly clustering. We present a comprehensive framework for evaluating various topological network characteristics. Additionally, we employ two dimensionality reduction techniques, namely Principal Component Analysis (PCA) and Isometric Feature Mapping (ISOMAP), to reduce overlaps of highly correlated features and enhance the interpretability of the subsequent classification results. We then utilize two clustering algorithms, K-means and HDBSCAN, to classify 14 transportation networks. The PCA method, followed by the K-means clustering approach, outperforms other alternatives with a Silhouette score of $0.510$, enabling the classification of transportation networks into five clusters. We also provide a detailed discussion on the resulting classification.


Cluster trajectory of SOFA score in predicting mortality in sepsis

arXiv.org Artificial Intelligence

Objective: Sepsis is a life-threatening condition. Sequential Organ Failure Assessment (SOFA) score is commonly used to assess organ dysfunction and predict ICU mortality, but it is taken as a static measurement and fails to capture dynamic changes. This study aims to investigate the relationship between dynamic changes in SOFA scores over the first 72 hours of ICU admission and patient outcomes. Design, setting, and participants: 3,253 patients in the Medical Information Mart for Intensive Care IV database who met the sepsis-3 criteria and were admitted from the emergency department with at least 72 hours of ICU admission and full-active resuscitation status were analysed. Group-based trajectory modelling with dynamic time warping and k-means clustering identified distinct trajectory patterns in dynamic SOFA scores. They were subsequently compared using Python. Main outcome measures: Outcomes including hospital and ICU mortality, length of stay in hospital and ICU, and readmission during hospital stay, were collected. Discharge time from ICU to wards and cut-offs at 7-day and 14-day were taken. Results: Four clusters were identified: A (consistently low SOFA scores), B (rapid increase followed by a decline in SOFA scores), C (higher baseline scores with gradual improvement), and D (persistently elevated scores). Cluster D had the longest ICU and hospital stays, highest ICU and hospital mortality. Discharge rates from ICU were similar for Clusters A and B, while Cluster C had initially comparable rates but a slower transition to ward. Conclusion: Monitoring dynamic changes in SOFA score is valuable for assessing sepsis severity and treatment responsiveness.


Privacy-Preserving Algorithmic Recourse

arXiv.org Artificial Intelligence

When individuals are subject to adverse outcomes from machine learning models, providing a recourse path to help achieve a positive outcome is desirable. Recent work has shown that counterfactual explanations - which can be used as a means of single-step recourse - are vulnerable to privacy issues, putting an individuals' privacy at risk. Providing a sequential multi-step path for recourse can amplify this risk. Furthermore, simply adding noise to recourse paths found from existing methods can impact the realism and actionability of the path for an end-user. In this work, we address privacy issues when generating realistic recourse paths based on instance-based counterfactual explanations, and provide PrivRecourse: an end-to-end privacy preserving pipeline that can provide realistic recourse paths. PrivRecourse uses differentially private (DP) clustering to represent non-overlapping subsets of the private dataset. These DP cluster centers are then used to generate recourse paths by forming a graph with cluster centers as the nodes, so that we can generate realistic - feasible and actionable - recourse paths. We empirically evaluate our approach on finance datasets and compare it to simply adding noise to data instances, and to using DP synthetic data, to generate the graph. We observe that PrivRecourse can provide paths that are private and realistic.


How to Use K-means for Big Data Clustering?

arXiv.org Artificial Intelligence

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.


AutoKG: Efficient Automated Knowledge Graph Generation for Language Models

arXiv.org Artificial Intelligence

Traditional methods of linking large language models (LLMs) to knowledge bases via the semantic similarity search often fall short of capturing complex relational dynamics. To address these limitations, we introduce AutoKG, a lightweight and efficient approach for automated knowledge graph (KG) construction. For a given knowledge base consisting of text blocks, AutoKG first extracts keywords using a LLM and then evaluates the relationship weight between each pair of keywords using graph Laplace learning. We employ a hybrid search scheme combining vector similarity and graph-based associations to enrich LLM responses. Preliminary experiments demonstrate that AutoKG offers a more comprehensive and interconnected knowledge retrieval mechanism compared to the semantic similarity search, thereby enhancing the capabilities of LLMs in generating more insightful and relevant outputs.


A Unified Framework for Fair Spectral Clustering With Effective Graph Learning

arXiv.org Artificial Intelligence

We consider the problem of spectral clustering under group fairness constraints, where samples from each sensitive group are approximately proportionally represented in each cluster. Traditional fair spectral clustering (FSC) methods consist of two consecutive stages, i.e., performing fair spectral embedding on a given graph and conducting $k$means to obtain discrete cluster labels. However, in practice, the graph is usually unknown, and we need to construct the underlying graph from potentially noisy data, the quality of which inevitably affects subsequent fair clustering performance. Furthermore, performing FSC through separate steps breaks the connections among these steps, leading to suboptimal results. To this end, we first theoretically analyze the effect of the constructed graph on FSC. Motivated by the analysis, we propose a novel graph construction method with a node-adaptive graph filter to learn graphs from noisy data. Then, all independent stages of conventional FSC are integrated into a single objective function, forming an end-to-end framework that inputs raw data and outputs discrete cluster labels. An algorithm is developed to jointly and alternately update the variables in each stage. Finally, we conduct extensive experiments on synthetic, benchmark, and real data, which show that our model is superior to state-of-the-art fair clustering methods.