Goto

Collaborating Authors

 Clustering


Reference-Based Sequence Classification

arXiv.org Machine Learning

Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.


Sampling Clustering

arXiv.org Artificial Intelligence

We propose an efficient linear-time graph-based divisive cluster analysis approach called Sampling Clustering. It constructs a lite informative dendrogram by recursively dividing a graph into subgraphs. In each recursive call, a graph is sampled first with a set of vertices being removed to disconnect latent clusters, then condensed by adding edges to the remaining vertices to avoid graph fragmentation caused by vertex removals. We also present some sampling and condensing methods and discuss the effectiveness in this paper. Our implementations run in linear time and achieve outstanding performance on various types of datasets. Experimental results show that they outperform state-of-the-art clustering algorithms with significantly less computing resource requirements.


Adaptation of Multivariate Concept to Multi-Way Agglomerative Clustering for Hierarchical Aspect Aggregation

AAAI Conferences

Hierarchical review aspect aggregation is an important challenge in review summarization. Currently, agglomerative clustering is widely used for hierarchical aspect aggregation. We identify an important but less studied issue in using agglomerative clustering for the aforementioned task. This paper proposes a novel approach to generate a multi-way hierarchy by adaptation of the multivariate concept. Furthermore, we propose a novel experimentation approach to evaluate the acceptability of the aspect relations obtained from the hierarchy generated.


Gene Selection and Clustering of Breast Cancer Data

AAAI Conferences

In this work, we first attempt to replicate an earlier study on gene selection and clustering, and then we extend this work by applying a different type of hierarchical clustering to dis- cover interesting subsets of genes from breast cancer data. Replication of such studies is a known challenge and an ac- tive area of research in bioinformatics. The work presented in this paper is three-fold. First, we replicate a study conducted at the University of North Carolina to generate an initial set of genes. Second, we apply an approach called Distance Weighted Discrimination to fuse multiple, disparate breast cancer datasets into a single validation set. Third, we per- form hierarchical clustering and k-means clustering on this validation set to discover natural groupings and compare the clusters generated by both methods. While applying the hi- erarchical clustering is part of the reproduction step, we ex- tend the research by trying two different forms of hierarchi- cal clustering. We also apply k-means clustering for the same purpose and compare all three methods using Kaplan-Meier estimation and Cox proportional hazards regression. We dis- cover that among the three methods, k-means clustering gives us the best results.


Spectral Clustering of Signed Graphs via Matrix Power Means

arXiv.org Machine Learning

Signed graphs encode positive (attractive) and negative (repulsive) relations between nodes. We extend spectral clustering to signed graphs via the one-parameter family of Signed Power Mean Laplacians, defined as the matrix power mean of normalized standard and signless Laplacians of positive and negative edges. We provide a thorough analysis of the proposed approach in the setting of a general Stochastic Block Model that includes models such as the Labeled Stochastic Block Model and the Censored Block Model. We show that in expectation the signed power mean Laplacian captures the ground truth clusters under reasonable settings where state-of-the-art approaches fail. Moreover, we prove that the eigenvalues and eigenvector of the signed power mean Laplacian concentrate around their expectation under reasonable conditions in the general Stochastic Block Model. Extensive experiments on random graphs and real world datasets confirm the theoretically predicted behaviour of the signed power mean Laplacian and show that it compares favourably with state-of-the-art methods.


EasiCS: the objective and fine-grained classification method of cervical spondylosis dysfunction

arXiv.org Machine Learning

In order to achieve it, we proposed and developed the classification framework EasiCS to obtain the relative stability The cervical spondylosis(CS), a common degenerative clustering results, which consists of dimension reduction, disease, harms human life and health, affects up clustering algorithm EasiSOM, spectral clustering algorithm to two-thirds of the population, and poses an serious EasiSC as shown in the Figure 1. To the best of our burden on individuals and society (Matz et al. 2009; knowledge, the EasiCS is the first effort to utilize the clustering Kotil and Bilge 2008; Cai et al. 2016; Nana Wang; algorithm and sEMG. Compared with the seven commonly Wang et al. 2018). Currently, the neck disability index used clustering algorithms, the novelty framework (Howard Vernon) is the most commonly used tool EasiCS provide the best overall performance. The cervical to assess the neck dysfunction (Vernon and Mior 1991), spondylosis(CS), a common degenerative disease, harms human The availability of which are mainly undermined by the life and health, affects up to two-thirds of the population, coarse-grained and unreasonable classification, despite that and poses an serious burden on individuals and society the NDI information is subjective and not accurate enough.


A self-organising eigenspace map for time series clustering

arXiv.org Machine Learning

This paper presents a novel time series clustering method, the self-organising eigenspace map (SOEM), based on a generalisation of the well-known self-organising feature map (SOFM). The SOEM operates on the eigenspaces of the embedded covariance structures of time series which are related directly to modes in those time series. Approximate joint diagonalisation acts as a pseudo-metric across these spaces allowing us to generalise the SOFM to a neural network with matrix input. The technique is empirically validated against three sets of experiments; univariate and multivariate time series clustering, and application to (clustered) multi-variate time series forecasting. Results indicate that the technique performs a valid topologically ordered clustering of the time series. The clustering is superior in comparison to standard benchmarks when the data is non-aligned, gives the best clustering stage for when used in forecasting, and can be used with partial/non-overlapping time series, multivariate clustering and produces a topological representation of the time series objects.


Multi-View Multiple Clustering

arXiv.org Machine Learning

Multiple clustering aims at exploring alternative clusterings to organize the data into meaningful groups from different perspectives. Existing multiple clustering algorithms are designed for single-view data. We assume that the individuality and commonality of multi-view data can be leveraged to generate high-quality and diverse clusterings. To this end, we propose a novel multi-view multiple clustering (MVMC) algorithm. MVMC first adapts multi-view self-representation learning to explore the individuality encoding matrices and the shared commonality matrix of multi-view data. It additionally reduces the redundancy (i.e., enhancing the individuality) among the matrices using the Hilbert-Schmidt Independence Criterion (HSIC), and collects shared information by forcing the shared matrix to be smooth across all views. It then uses matrix factorization on the individual matrices, along with the shared matrix, to generate diverse clusterings of high-quality. We further extend multiple co-clustering on multi-view data and propose a solution called multi-view multiple co-clustering (MVMCC). Our empirical study shows that MVMC (MVMCC) can exploit multi-view data to generate multiple high-quality and diverse clusterings (co-clusterings), with superior performance to the state-of-the-art methods.


Determining Number of Clusters in One Picture

#artificialintelligence

If you want to determine the optimal number of clusters in your analysis, you're faced with an overwhelming number of (mostly subjective) choices. Note that there's no "best" method, no "correct" k, and there isn't even a consensus as to the definition of what a "cluster" is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.


Explainable AI for Trees: From Local Explanations to Global Understanding

arXiv.org Machine Learning

Tree-based machine learning models such as random forests, decision trees, and gradient boosted trees are the most popular non-linear predictive models used in practice today, yet comparatively little attention has been paid to explaining their predictions. Here we significantly improve the interpretability of tree-based models through three main contributions: 1) The first polynomial time algorithm to compute optimal explanations based on game theory. 2) A new type of explanation that directly measures local feature interaction effects. 3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to i) identify high magnitude but low frequency non-linear mortality risk factors in the general US population, ii) highlight distinct population sub-groups with shared risk characteristics, iii) identify non-linear interaction effects among risk factors for chronic kidney disease, and iv) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model's performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.