Clustering
Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models
Tian, Ye, Weng, Haolei, Feng, Yang
Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees.
LEFL: Low Entropy Client Sampling in Federated Learning
Abebe, Waqwoya, Munoz, Pablo, Jannesari, Ali
Federated learning (FL) is a machine learning paradigm where multiple clients collaborate to optimize a single global model using their private data. The global model is maintained by a central server that orchestrates the FL training process through a series of training rounds. In each round, the server samples clients from a client pool before sending them its latest global model parameters for further optimization. Naive sampling strategies implement random client sampling and fail to factor client data distributions for privacy reasons. Hence we proposes an alternative sampling strategy by performing a one-time clustering of clients based on their model's learned high-level features while respecting data privacy. This enables the server to perform stratified client sampling across clusters in every round. We show datasets of sampled clients selected with this approach yield a low relative entropy with respect to the global data distribution. Consequently, the FL training becomes less noisy and significantly improves the convergence of the global model by as much as 7.4% in some experiments. Furthermore, it also significantly reduces the communication rounds required to achieve a target accuracy.
Efficient High-Quality Clustering for Large Bipartite Graphs
A bipartite graph contains inter-set edges between two disjoint vertex sets, and is widely used to model real-world data, such as user-item purchase records, author-article publications, and biological interactions between drugs and proteins. k-Bipartite Graph Clustering (k-BGC) is to partition the target vertex set in a bipartite graph into k disjoint clusters. The clustering quality is important to the utility of k-BGC in various applications like social network analysis, recommendation systems, text mining, and bioinformatics, to name a few. Existing approaches to k-BGC either output clustering results with compromised quality due to inadequate exploitation of high-order information between vertices, or fail to handle sizable bipartite graphs with billions of edges. Motivated by this, this paper presents two efficient k-BGC solutions, HOPE and HOPE+, which achieve state-of-the-art performance on large-scale bipartite graphs. HOPE obtains high scalability and effectiveness through a new k-BGC problem formulation based on the novel notion of high-order perspective (HOP) vectors and an efficient technique for low-rank approximation of HOP vectors. HOPE+ further elevates the k-BGC performance to another level with a judicious problem transformation and a highly efficient two-stage optimization framework. Two variants, HOPE+ (FNEM) and HOPE+ (SNEM) are designed when either the Frobenius norm or spectral norm is applied in the transformation. Extensive experiments, comparing HOPE and HOPE+ against 13 competitors on 10 real-world datasets, exhibit that our solutions, especially HOPE+, are superior to existing methods in terms of result quality, while being up to orders of magnitude faster. On the largest dataset MAG with 1.1 billion edges, HOPE+ is able to produce clusters with the highest clustering accuracy within 31 minutes, which is unmatched by any existing solution for k-BGC.
A Contrastive Variational Graph Auto-Encoder for Node Clustering
Mrabah, Nairouz, Bouguessa, Mohamed, Ksantini, Riadh
Variational Graph Auto-Encoders (VGAEs) have been widely used to solve the node clustering task. However, the state-of-the-art methods have numerous challenges. First, existing VGAEs do not account for the discrepancy between the inference and generative models after incorporating the clustering inductive bias. Second, current models are prone to degenerate solutions that make the latent codes match the prior independently of the input signal (i.e., Posterior Collapse). Third, existing VGAEs overlook the effect of the noisy clustering assignments (i.e., Feature Randomness) and the impact of the strong trade-off between clustering and reconstruction (i.e., Feature Drift). To address these problems, we formulate a variational lower bound in a contrastive setting. Our lower bound is a tighter approximation of the log-likelihood function than the corresponding Evidence Lower BOund (ELBO). Thanks to a newly identified term, our lower bound can escape Posterior Collapse and has more flexibility to account for the difference between the inference and generative models. Additionally, our solution has two mechanisms to control the trade-off between Feature Randomness and Feature Drift. Extensive experiments show that the proposed method achieves state-of-the-art clustering results on several datasets. We provide strong evidence that this improvement is attributed to four aspects: integrating contrastive learning and alleviating Feature Randomness, Feature Drift, and Posterior Collapse.
An Evaluation of Machine Learning Approaches for Early Diagnosis of Autism Spectrum Disorder
Rasul, Rownak Ara, Saha, Promy, Bala, Diponkor, Karim, S M Rakib Ul, Abdullah, Md. Ibrahim, Saha, Bishwajit
Autistic Spectrum Disorder (ASD) is a neurological disease characterized by difficulties with social interaction, communication, and repetitive activities. While its primary origin lies in genetics, early detection is crucial, and leveraging machine learning offers a promising avenue for a faster and more cost-effective diagnosis. This study employs diverse machine learning methods to identify crucial ASD traits, aiming to enhance and automate the diagnostic process. We study eight state-of-the-art classification models to determine their effectiveness in ASD detection. We evaluate the models using accuracy, precision, recall, specificity, F1-score, area under the curve (AUC), kappa, and log loss metrics to find the best classifier for these binary datasets. Among all the classification models, for the children dataset, the SVM and LR models achieve the highest accuracy of 100% and for the adult dataset, the LR model produces the highest accuracy of 97.14%. Our proposed ANN model provides the highest accuracy of 94.24% for the new combined dataset when hyperparameters are precisely tuned for each model. As almost all classification models achieve high accuracy which utilize true labels, we become interested in delving into five popular clustering algorithms to understand model behavior in scenarios without true labels. We calculate Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and Silhouette Coefficient (SC) metrics to select the best clustering models. Our evaluation finds that spectral clustering outperforms all other benchmarking clustering models in terms of NMI and ARI metrics while demonstrating comparability to the optimal SC achieved by k-means. The implemented code is available at GitHub.
Catch Me if You Can: Effective Honeypot Placement in Dynamic AD Attack Graphs
Ngo, Huy Quang, Guo, Mingyu, Nguyen, Hung
We study a Stackelberg game between an attacker and a defender on large Active Directory (AD) attack graphs where the defender employs a set of honeypots to stop the attacker from reaching high-value targets. Contrary to existing works that focus on small and static attack graphs, AD graphs typically contain hundreds of thousands of nodes and edges and constantly change over time. We consider two types of attackers: a simple attacker who cannot observe honeypots and a competent attacker who can. To jointly solve the game, we propose a mixed-integer programming (MIP) formulation. We observed that the optimal blocking plan for static graphs performs poorly in dynamic graphs. To solve the dynamic graph problem, we re-design the mixed-integer programming formulation by combining m MIP (dyMIP(m)) instances to produce a near-optimal blocking plan. Furthermore, to handle a large number of dynamic graph instances, we use a clustering algorithm to efficiently find the m-most representative graph instances for a constant m (dyMIP(m)). We prove a lower bound on the optimal blocking strategy for dynamic graphs and show that our dyMIP(m) algorithms produce close to optimal results for a range of AD graphs under realistic conditions.
scRNA-seq Data Clustering by Cluster-aware Iterative Contrastive Learning
Jiang, Weikang, Wang, Jinxian, Guan, Jihong, Zhou, Shuigeng
Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for scRNA-seq data clustering, which utilizes an iterative representation learning and clustering framework to progressively learn the clustering structure of scRNA-seq data with a cluster-aware contrastive loss. CICL consists of a Transformer encoder, a clustering head, a projection head and a contrastive loss module. First, CICL extracts the feature vectors of the original and augmented data by the Transformer encoder. Then, it computes the clustering centroids by K-means and employs the student t-distribution to assign pseudo-labels to all cells in the clustering head. The projection-head uses a Multi-Layer Perceptron (MLP) to obtain projections of the augmented data. At last, both pseudo-labels and projections are used in the contrastive loss to guide the model training. Such a process goes iteratively so that the clustering result becomes better and better. Extensive experiments on 25 real world scRNA-seq datasets show that CICL outperforms the SOTA methods. Concretely, CICL surpasses the existing methods by from 14% to 280%, and from 5% to 133% on average in terms of performance metrics ARI and NMI respectively.
EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction
Esfahani, Mohammadali Sefidi, Akbari, Mohammad
Social platforms have emerged as crucial platforms for disseminating information and discussing real-life social events, offering researchers an excellent opportunity to design and implement novel event detection frameworks. However, most existing approaches only exploit keyword burstiness or network structures to detect unspecified events. Thus, they often need help identifying unknown events regarding the challenging nature of events and social data. Social data, e.g., tweets, is characterized by misspellings, incompleteness, word sense ambiguation, irregular language, and variation in aspects of opinions. Moreover, extracting discriminative features and patterns for evolving events by exploiting the limited structural knowledge is almost infeasible. To address these challenges, in this paper, we propose a novel framework, namely EnrichEvent, that leverages the linguistic and contextual representations of streaming social data. In particular, we leverage contextual and linguistic knowledge to detect semantically related tweets and enhance the effectiveness of the event detection approaches. Eventually, our proposed framework produces cluster chains for each event to show the evolving variation of the event through time. We conducted extensive experiments to evaluate our framework, validating its high performance and effectiveness in detecting and distinguishing unspecified social events.
Unsupervised Learning of Phylogenetic Trees via Split-Weight Embedding
Kong, Yibo, Tiley, George P., Solis-Lemus, Claudia
The Tree of Life is a massive graphical structure which represents the evolutionary process from single cell organisms into the immense biodiversity of living species in present time. Estimating the Tree of Life would not only represent the greatest accomplishment in evolutionary biology and systematics, but it would also allow us to fully understand the development and evolution of important biological traits in nature, in particular, those related to resilience to extinction when exposed to environmental threats such as climate change. Therefore, the development of statistical and machine-learning theory to reconstruct the Tree of Life, especially those scalable to big data, are paramount in evolutionary biology, systematics, and conservation efforts against mass extinctions. Graphical structures that represent evolutionary processes are denoted phylogenetic trees. A phylogenetic tree is a binary tree whose internal nodes represent ancestral species that over time differentiate into two separate species giving rise to its two children nodes (see Figure 1 left). The evolutionary process is then depicted by this bifurcating tree from the root (the origin of life) to the external nodes of the tree (also denoted leaves) which represent the living organisms today.
Stochastic mean-shift clustering
It estimates the probability density function of a random variable Fukunaga & Hostetler (1975). The clustering algorithm is applied to a variety of areas, like segmentation images, Tao et al. (2007); Paris & Durand (2007), particularly medical and satellite images Lu et al. (2011); Ai & Xiong (2014); Wu & Luo (2015); Banerjee et al. (2012), videos Wang et al. (2004), and also applied to high dimensional data clustering Saptarshi et al. (2021). An adapted version of mean-shift clustering was applied to short segments speaker clustering Salmun et al. (2016b,a, 2017); Cohen & Lapidot (2021). This algorithm is deterministic and in an iterative procedure estimates the multi-modal probability density function (pdf) via the "climbing" path of each datum to its mode in a multi-modal distribution. All the data points that reached the same mode are grouped to the same cluster.