Goto

Collaborating Authors

 Clustering


Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

arXiv.org Artificial Intelligence

Debris disks, which consist of dust, planetesimals, planets, and gas, offer a unique window into the mineralogical composition of their parent bodies, especially during the critical phase of terrestrial planet formation spanning 10 to a few hundred million years. Observations from the $\textit{Spitzer}$ Space Telescope have unveiled thousands of debris disks, yet systematic studies remain scarce, let alone those with unsupervised clustering techniques. This study introduces $\texttt{CLUES}$ (CLustering UnsupErvised with Sequencer), a novel, non-parametric, fully-interpretable machine-learning spectral analysis tool designed to analyze and classify the spectral data of debris disks. $\texttt{CLUES}$ combines multiple unsupervised clustering methods with multi-scale distance measures to discern new groupings and trends, offering insights into compositional diversity and geophysical processes within these disks. Our analysis allows us to explore a vast parameter space in debris disk mineralogy and also offers broader applications in fields such as protoplanetary disks and solar system objects. This paper details the methodology, implementation, and initial results of $\texttt{CLUES}$, setting the stage for more detailed follow-up studies focusing on debris disk mineralogy and demographics.


Time-Varying Graph Learning for Data with Heavy-Tailed Distribution

arXiv.org Artificial Intelligence

Graph models provide efficient tools to capture the underlying structure of data defined over networks. Many real-world network topologies are subject to change over time. Learning to model the dynamic interactions between entities in such networks is known as time-varying graph learning. Current methodology for learning such models often lacks robustness to outliers in the data and fails to handle heavy-tailed distributions, a common feature in many real-world datasets (e.g., financial data). This paper addresses the problem of learning time-varying graph models capable of efficiently representing heavy-tailed data. Unlike traditional approaches, we incorporate graph structures with specific spectral properties to enhance data clustering in our model. Our proposed method, which can also deal with noise and missing values in the data, is based on a stochastic approach, where a non-negative vector auto-regressive (VAR) model captures the variations in the graph and a Student-t distribution models the signal originating from this underlying time-varying graph. We propose an iterative method to learn time-varying graph topologies within a semi-online framework where only a mini-batch of data is used to update the graph. Simulations with both synthetic and real datasets demonstrate the efficacy of our model in analyzing heavy-tailed data, particularly those found in financial markets.


Federated Deep Subspace Clustering

arXiv.org Artificial Intelligence

This paper introduces FDSC, a private-protected subspace clustering (SC) approach with federated learning (FC) schema. In each client, there is a deep subspace clustering network accounting for grouping the isolated data, composed of a encode network, a self-expressive layer, and a decode network. FDSC is achieved by uploading the encode network to communicate with other clients in the server. Besides, FDSC is also enhanced by preserving the local neighborhood relationship in each client. With the effects of federated learning and locality preservation, the learned data features from the encoder are boosted so as to enhance the self-expressiveness learning and result in better clustering performance. Experiments test FDSC on public datasets and compare with other clustering methods, demonstrating the effectiveness of FDSC.


Attributed Graph Clustering in Collaborative Settings

arXiv.org Artificial Intelligence

Graph clustering is an unsupervised machine learning method that partitions the nodes in a graph into different groups. Despite achieving significant progress in exploiting both attributed and structured data information, graph clustering methods often face practical challenges related to data isolation. Moreover, the absence of collaborative methods for graph clustering limits their effectiveness. In this paper, we propose a collaborative graph clustering framework for attributed graphs, supporting attributed graph clustering over vertically partitioned data with different participants holding distinct features of the same data. Our method leverages a novel technique that reduces the sample space, improving the efficiency of the attributed graph clustering method. Furthermore, we compare our method to its centralized counterpart under a proximity condition, demonstrating that the successful local results of each participant contribute to the overall success of the collaboration. We fully implement our approach and evaluate its utility and efficiency by conducting experiments on four public datasets. The results demonstrate that our method achieves comparable accuracy levels to centralized attributed graph clustering methods. Our collaborative graph clustering framework provides an efficient and effective solution for graph clustering challenges related to data isolation.


Dive into Time-Series Anomaly Detection: A Decade Review

arXiv.org Machine Learning

Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research.


Bridging the Gap: A Decade Review of Time-Series Clustering Methods

arXiv.org Artificial Intelligence

Time series, as one of the most fundamental representations of sequential data, has been extensively studied across diverse disciplines, including computer science, biology, geology, astronomy, and environmental sciences. The advent of advanced sensing, storage, and networking technologies has resulted in high-dimensional time-series data, however, posing significant challenges for analyzing latent structures over extended temporal scales. Time-series clustering, an established unsupervised learning strategy that groups similar time series together, helps unveil hidden patterns in these complex datasets. In this survey, we trace the evolution of time-series clustering methods from classical approaches to recent advances in neural networks. While previous surveys have focused on specific methodological categories, we bridge the gap between traditional clustering methods and emerging deep learning-based algorithms, presenting a comprehensive, unified taxonomy for this research area. This survey highlights key developments and provides insights to guide future research in time-series clustering.


A Survey on Time-Series Distance Measures

arXiv.org Artificial Intelligence

Distance measures have been recognized as one of the fundamental building blocks in time-series analysis tasks, e.g., querying, indexing, classification, clustering, anomaly detection, and similarity search. The vast proliferation of time-series data across a wide range of fields has increased the relevance of evaluating the effectiveness and efficiency of these distance measures. To provide a comprehensive view of this field, this work considers over 100 state-of-the-art distance measures, classified into 7 categories: lock-step measures, sliding measures, elastic measures, kernel measures, feature-based measures, model-based measures, and embedding measures. Beyond providing comprehensive mathematical frameworks, this work also delves into the distinctions and applications across these categories for both univariate and multivariate cases. By providing comprehensive collections and insights, this study paves the way for the future development of innovative time-series distance measures.


Asynchronous Federated Clustering with Unknown Number of Clusters

arXiv.org Artificial Intelligence

Federated Clustering (FC) is crucial to mining knowledge from unlabeled non-Independent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, and then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients' communication capacity and the unknown number of proper clusters $k^*$. To further bridge the gap between FC and real application scenarios, this paper first shows that the clients' communication asynchrony and unknown $k^*$ are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to the clients as a learning medium and coordinates them across the clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.


Self-Calibrated Dual Contrasting for Annotation-Efficient Bacteria Raman Spectroscopy Clustering and Classification

arXiv.org Artificial Intelligence

Raman scattering is based on molecular vibration spectroscopy and provides a powerful technology for pathogenic bacteria diagnosis using the unique molecular fingerprint information of a substance. The integration of deep learning technology has significantly improved the efficiency and accuracy of intelligent Raman spectroscopy (RS) recognition. However, the current RS recognition methods based on deep neural networks still require the annotation of a large amount of spectral data, which is labor-intensive. This paper presents a novel annotation-efficient Self-Calibrated Dual Contrasting (SCDC) method for RS recognition that operates effectively with few or no annotation. Our core motivation is to represent the spectrum from two different perspectives in two distinct subspaces: embedding and category. The embedding perspective captures instance-level information, while the category perspective reflects category-level information. Accordingly, we have implemented a dual contrastive learning approach from two perspectives to obtain discriminative representations, which are applicable for Raman spectroscopy recognition under both unsupervised and semi-supervised learning conditions. Furthermore, a self-calibration mechanism is proposed to enhance robustness. Validation of the identification task on three large-scale bacterial Raman spectroscopy datasets demonstrates that our SCDC method achieves robust recognition performance with very few (5$\%$ or 10$\%$) or no annotations, highlighting the potential of the proposed method for biospectral identification in annotation-efficient clinical scenarios.


A Greedy Strategy for Graph Cut

arXiv.org Artificial Intelligence

We propose a Greedy strategy to solve the problem of Graph Cut, called GGC. It starts from the state where each data sample is regarded as a cluster and dynamically merges the two clusters which reduces the value of the global objective function the most until the required number of clusters is obtained, and the monotonicity of the sequence of objective function values is proved. To reduce the computational complexity of GGC, only mergers between clusters and their neighbors are considered. Therefore, GGC has a nearly linear computational complexity with respect to the number of samples. Also, unlike other algorithms, due to the greedy strategy, the solution of the proposed algorithm is unique. In other words, its performance is not affected by randomness. We apply the proposed method to solve the problem of normalized cut which is a widely concerned graph cut problem. Extensive experiments show that better solutions can often be achieved compared to the traditional two-stage optimization algorithm (eigendecomposition + k-means), on the normalized cut problem. In addition, the performance of GGC also has advantages compared to several state-of-the-art clustering algorithms.