Goto

Collaborating Authors

 Clustering


Interpretable Multi-View Clustering Based on Anchor Graph Tensor Factorization

arXiv.org Artificial Intelligence

The clustering method based on the anchor graph has gained significant attention due to its exceptional clustering performance and ability to process large-scale data. One common approach is to learn bipartite graphs with K-connected components, helping avoid the need for post-processing. However, this method has strict parameter requirements and may not always get K-connected components. To address this issue, an alternative approach is to directly obtain the cluster label matrix by performing non-negative matrix factorization (NMF) on the anchor graph. Nevertheless, existing multi-view clustering methods based on anchor graph factorization lack adequate cluster interpretability for the decomposed matrix and often overlook the inter-view information. We address this limitation by using non-negative tensor factorization to decompose an anchor graph tensor that combines anchor graphs from multiple views. This approach allows us to consider inter-view information comprehensively. The decomposed tensors, namely the sample indicator tensor and the anchor indicator tensor, enhance the interpretability of the factorization. Extensive experiments validate the effectiveness of this method.


Visualizing Routes with AI-Discovered Street-View Patterns

arXiv.org Artificial Intelligence

Street-level visual appearances play an important role in studying social systems, such as understanding the built environment, driving routes, and associated social and economic factors. It has not been integrated into a typical geographical visualization interface (e.g., map services) for planning driving routes. In this paper, we study this new visualization task with several new contributions. First, we experiment with a set of AI techniques and propose a solution of using semantic latent vectors for quantifying visual appearance features. Second, we calculate image similarities among a large set of street-view images and then discover spatial imagery patterns. Third, we integrate these discovered patterns into driving route planners with new visualization techniques. Finally, we present VivaRoutes, an interactive visualization prototype, to show how visualizations leveraged with these discovered patterns can help users effectively and interactively explore multiple routes. Furthermore, we conducted a user study to assess the usefulness and utility of VivaRoutes.


Clustering for Protein Representation Learning

arXiv.org Artificial Intelligence

Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering, until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene ontology term prediction, and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.


Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer's Disease and Related Dementias

arXiv.org Artificial Intelligence

Alzheimer's disease is a progressive, debilitating neurodegenerative disease that affects 50 million people globally. Despite this substantial health burden, available treatments for the disease are limited and its fundamental causes remain poorly understood. Previous work has suggested the existence of clinically-meaningful sub-types, which it is suggested may correspond to distinct etiologies, disease courses, and ultimately appropriate treatments. Here, we use unsupervised learning techniques on electronic health records (EHRs) from a cohort of memory disorder patients to characterise heterogeneity in this disease population. Pre-trained embeddings for medical codes as well as transformer-derived Clinical BERT embeddings of free text are used to encode patient EHRs. We identify the existence of sub-populations on the basis of comorbidities and shared textual features, and discuss their clinical significance.


From Similarity to Superiority: Channel Clustering for Time Series Forecasting

arXiv.org Artificial Intelligence

Time series forecasting has attracted significant attention in recent decades. Previous studies have demonstrated that the Channel-Independent (CI) strategy improves forecasting performance by treating different channels individually, while it leads to poor generalization on unseen instances and ignores potentially necessary interactions between channels. Conversely, the Channel-Dependent (CD) strategy mixes all channels with even irrelevant and indiscriminate information, which, however, results in oversmoothing issues and limits forecasting accuracy. There is a lack of channel strategy that effectively balances individual channel treatment for improved forecasting performance without overlooking essential interactions between channels. Motivated by our observation of a correlation between the time series model's performance boost against channel mixing and the intrinsic similarity on a pair of channels, we developed a novel and adaptable Channel Clustering Module (CCM). CCM dynamically groups channels characterized by intrinsic similarities and leverages cluster identity instead of channel identity, combining the best of CD and CI worlds. Extensive experiments on real-world datasets demonstrate that CCM can (1) boost the performance of CI and CD models by an average margin of 2.4% and 7.2% on long-term and short-term forecasting, respectively; (2) enable zero-shot forecasting with mainstream time series forecasting models; (3) uncover intrinsic time series patterns among channels and improve interpretability of complex time series models.


Learning using granularity statistical invariants for classification

arXiv.org Artificial Intelligence

Learning using statistical invariants (LUSI) is a new learning paradigm, which adopts weak convergence mechanism, and can be applied to a wider range of classification problems. However, the computation cost of invariant matrices in LUSI is high for large-scale datasets during training. To settle this issue, this paper introduces a granularity statistical invariant for LUSI, and develops a new learning paradigm called learning using granularity statistical invariants (LUGSI). LUGSI employs both strong and weak convergence mechanisms, taking a perspective of minimizing expected risk. As far as we know, it is the first time to construct granularity statistical invariants. Compared to LUSI, the introduction of this new statistical invariant brings two advantages. Firstly, it enhances the structural information of the data. Secondly, LUGSI transforms a large invariant matrix into a smaller one by maximizing the distance between classes, achieving feasibility for large-scale datasets classification problems and significantly enhancing the training speed of model operations. Experimental results indicate that LUGSI not only exhibits improved generalization capabilities but also demonstrates faster training speed, particularly for large-scale datasets.


Classification and Clustering of Sentence-Level Embeddings of Scientific Articles Generated by Contrastive Learning

arXiv.org Artificial Intelligence

Scientific articles are long text documents organized into sections, each describing aspects of the research. Analyzing scientific production has become progressively challenging due to the increase in the number of available articles. Within this scenario, our approach consisted of fine-tuning transformer language models to generate sentence-level embeddings from scientific articles, considering the following labels: background, objective, methods, results, and conclusion. We trained our models on three datasets with contrastive learning. Two datasets are from the article's abstracts in the computer science and medical domains. Also, we introduce PMC-Sents-FULL, a novel dataset of sentences extracted from the full texts of medical articles. We compare the fine-tuned and baseline models in clustering and classification tasks to evaluate our approach. On average, clustering agreement measures values were five times higher. For the classification measures, in the best-case scenario, we had an average improvement in F1-micro of 30.73\%. Results show that fine-tuning sentence transformers with contrastive learning and using the generated embeddings in downstream tasks is a feasible approach to sentence classification in scientific articles. Our experiment codes are available on GitHub.


A Two-Phase Recall-and-Select Framework for Fast Model Selection

arXiv.org Artificial Intelligence

As the ubiquity of deep learning in various machine learning applications has amplified, a proliferation of neural network models has been trained and shared on public model repositories. In the context of a targeted machine learning assignment, utilizing an apt source model as a starting point typically outperforms the strategy of training from scratch, particularly with limited training data. Despite the investigation and development of numerous model selection strategies in prior work, the process remains time-consuming, especially given the ever-increasing scale of model repositories. In this paper, we propose a two-phase (coarse-recall and fine-selection) model selection framework, aiming to enhance the efficiency of selecting a robust model by leveraging the models' training performances on benchmark datasets. Specifically, the coarse-recall phase clusters models showcasing similar training performances on benchmark datasets in an offline manner. A light-weight proxy score is subsequently computed between this model cluster and the target dataset, which serves to recall a significantly smaller subset of potential candidate models in a swift manner. In the following fine-selection phase, the final model is chosen by fine-tuning the recalled models on the target dataset with successive halving. To accelerate the process, the final fine-tuning performance of each potential model is predicted by mining the model's convergence trend on the benchmark datasets, which aids in filtering lower performance models more earlier during fine-tuning. Through extensive experimentation on tasks covering natural language processing and computer vision, it has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods. Our code is available at https://github.com/plasware/two-phase-selection.


Graph-Based Optimisation of Network Expansion in a Dockless Bike Sharing System

arXiv.org Artificial Intelligence

Bike-sharing systems (BSSs) are deployed in over a thousand cities worldwide and play an important role in many urban transportation systems. BSSs alleviate congestion, reduce pollution and promote physical exercise. It is essential to explore the spatiotemporal patterns of bike-sharing demand, as well as the factors that influence these patterns, in order to optimise system operational efficiency. In this study, an optimised geo-temporal graph is constructed using trip data from Moby Bikes, a dockless BSS operator. The process of optimising the graph unveiled prime locations for erecting new stations during future expansions of the BSS. The Louvain algorithm, a community detection technique, is employed to uncover usage patterns at different levels of temporal granularity. The community detection results reveal largely self-contained sub-networks that exhibit similar usage patterns at their respective levels of temporal granularity. Overall, this study reinforces that BSSs are intrinsically spatiotemporal systems, with community presence driven by spatiotemporal dynamics. These findings may aid operators in improving redistribution efficiency.


CDIMC-net: Cognitive Deep Incomplete Multi-view Clustering Network

arXiv.org Artificial Intelligence

In recent years, incomplete multi-view clustering, which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Although a series of methods have been proposed to address this issue, the following problems still exist: 1) Almost all of the existing methods are based on shallow models, which is difficult to obtain discriminative common representations. 2) These methods are generally sensitive to noise or outliers since the negative samples are treated equally as the important samples. In this paper, we propose a novel incomplete multi-view clustering network, called Cognitive Deep Incomplete Multi-view Clustering Network (CDIMC-net), to address these issues. Specifically, it captures the high-level features and local structure of each view by incorporating the view-specific deep encoders and graph embedding strategy into a framework. Moreover, based on the human cognition, i.e., learning from easy to hard, it introduces a self-paced strategy to select the most confident samples for model training, which can reduce the negative influence of outliers. Experimental results on several incomplete datasets show that CDIMC-net outperforms the state-of-the-art incomplete multi-view clustering methods.