Goto

Collaborating Authors

 Clustering


Unsupervised Network Embedding for Graph Visualization, Clustering and Classification

arXiv.org Machine Learning

A main challenge in mining network-based data is finding effective ways to represent or encode graph structures so that it can be efficiently exploited by machine learning algorithms. Several methods have focused in network representation at node/edge or substructure level. However, many real life challenges such as time-varying, multilayer, chemical compounds and brain networks involve analysis of a family of graphs instead of single one opening additional challenges in graph comparison and representation. Traditional approaches for learning representations relies on hand-crafting specialized heuristics to extract meaningful information about the graphs, e.g statistical properties, structural features, etc. as well as engineered graph distances to quantify dissimilarity between networks. In this work we provide an unsupervised approach to learn embedding representation for a collection of graphs so that it can be used in numerous graph mining tasks. By using an unsupervised neural network approach on input graphs, we aim to capture the underlying distribution of the data in order to discriminate between different class of networks. Our method is assessed empirically on synthetic and real life datasets and evaluated in three different tasks: graph clustering, visualization and classification. Results reveal that our method outperforms well known graph distances and graph-kernels in clustering and classification tasks, being highly efficient in runtime.


From Low-Level Events to Activities -- A Session-Based Approach (Extended Version)

arXiv.org Artificial Intelligence

Process-Mining techniques aim to use event data about past executions to gain insight into how processes are executed. While these techniques are proven to be very valuable, they are less successful to reach their goal if the process is flexible and, hence, events can potentially occur in any order. Furthermore, information systems can record events at very low level, which do not match the high-level concepts known at business level. Without abstracting sequences of events to high-level concepts, the results of applying process mining (e.g., discovered models) easily become very complex and difficult to interpret, which ultimately means that they are of little use. A large body of research exists on event abstraction but typically a large amount of domain knowledge is required to be fed in, which is often not readily available. Other abstraction techniques are unsupervised, which give lower accuracy. This paper puts forward a technique that requires limited domain knowledge that can be easily provided. Traces are divided in sessions, and each session is abstracted as one single high-level activity execution. The abstraction is based on a combination of automatic clustering and visualization methods. The technique was assessed on two case studies that evidently exhibits a large amount of behavior. The results clearly illustrate the benefits of the abstraction to convey knowledge to stakeholders.


Learning Latent Representations of Bank Customers With The Variational Autoencoder

arXiv.org Machine Learning

Learning data representations that reflect the customers' creditworthiness can improve marketing campaigns, customer relationship management, data and process management or the credit risk assessment in retail banks. In this research, we adopt the Variational Autoencoder (VAE), which has the ability to learn latent representations that contain useful information. We show that it is possible to steer the latent representations in the latent space of the VAE using the Weight of Evidence and forming a specific grouping of the data that reflects the customers' creditworthiness. Our proposed method learns a latent representation of the data, which shows a well-defied clustering structure capturing the customers' creditworthiness. These clusters are well suited for the aforementioned banks' activities.


Low-rank Kernel Learning for Graph-based Clustering

arXiv.org Machine Learning

Constructing the adjacency graph is fundamental to graph-based clustering. Graph learning in kernel space has shown impressive performance on a number of benchmark data sets. However, its performance is largely determined by the chosen kernel matrix. To address this issue, the previous multiple kernel learning algorithm has been applied to learn an optimal kernel from a group of predefined kernels. This approach might be sensitive to noise and limits the representation ability of the consensus kernel. In contrast to existing methods, we propose to learn a low-rank kernel matrix which exploits the similarity nature of the kernel matrix and seeks an optimal kernel from the neighborhood of candidate kernels. By formulating graph construction and kernel learning in a unified framework, the graph and consensus kernel can be iteratively enhanced by each other. Extensive experimental results validate the efficacy of the proposed method.


Non-Negative Local Sparse Coding for Subspace Clustering

arXiv.org Machine Learning

Subspace sparse coding (SSC) algorithms have proven to be beneficial to clustering problems. They provide an alternative data representation in which the underlying structure of the clusters can be better captured. However, most of the research in this area is mainly focused on enhancing the sparse coding part of the problem. In contrast, we introduce a novel objective term in our proposed SSC framework which focuses on the separability of data points in the coding space. We also provide mathematical insights into how this local-separability term improves the clustering result of the SSC framework. Our proposed non-linear local SSC algorithm (NLSSC) also benefits from the efficient choice of its sparsity terms and constraints. The NLSSC algorithm is also formulated in the kernel-based framework (NLKSSC) which can represent the nonlinear structure of data. In addition, we address the possibility of having redundancies in sparse coding results and its negative effect on graph-based clustering problems. We introduce the link-restore post-processing step to improve the representation graph of non-negative SSC algorithms such as ours. Empirical evaluations on well-known clustering benchmarks show that our proposed NLSSC framework results in better clusterings compared to the state-of-the-art baselines and demonstrate the effectiveness of the link-restore post-processing in improving the clustering accuracy via correcting the broken links of the representation graph.


Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

arXiv.org Machine Learning

Robust clustering of high-dimensional data is an important topic because, in many practical situations, real data sets are heavy-tailed and/or asymmetric. Moreover, traditional model-based clustering often fails for high dimensional data due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed by including a penalty term in the likelihood constraining the parameters resulting in a flexible model for high dimensional data and a meaningful interpretation. An analytically feasible EM algorithm is developed by placing a gamma-Lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and two real data sets.


Clustering methods for unsupervised machine learning

#artificialintelligence

Now we have the probability that each data point belongs to each cluster. If we need hard cluster assignments, we can just choose for each data point to belong to the cluster with the highest probability. But the nice thing about EM is that we can embrace the fuzziness of the cluster membership. We can look at a data point and consider the fact that while it most likely belongs to Cluster B, it's also quite likely to belong to Cluster D. This also takes into account the fact that there may not be clear cut boundaries between our clusters. These groups consist of overlapping multi-dimensional distributions, so drawing clear cut lines might not always be the best solution.


Similarity Learning via Kernel Preserving Embedding

arXiv.org Artificial Intelligence

Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been developed and successfully applied in various models, such as low-rank representation, sparse subspace learning, semi-supervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other state-of-the-art methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similarity-based tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed high-dimensional data into low-dimensional space.


Revisiting clustering as matrix factorisation on the Stiefel manifold

arXiv.org Machine Learning

Our approach leverages the well known Burer-Monteiro factorisation strategy from large scale optimisation, in the context of low rank estimation. Moreover, our Burer-Monteiro factors are shown to lie on a Stiefel manifold. We propose a new generalized Bayesian estimator for this problem and prove novel prediction bounds for clustering. We also devise a componentwise Langevin sampler on the Stiefel manifold to compute this estimator.


Diffusion $K$-means clustering on manifolds: provable exact recovery via semidefinite relaxations

arXiv.org Machine Learning

We introduce the {\it diffusion $K$-means} clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion $K$-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. Thus the diffusion $K$-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given the number of clusters, we propose a polynomial-time convex relaxation algorithm via the semidefinite programming (SDP) to solve the diffusion $K$-means. In addition, we also propose a nuclear norm (i.e., trace norm) regularized SDP that is adaptive to the number of clusters. In both cases, we show that exact recovery of the SDPs for diffusion $K$-means can be achieved under suitable between-cluster separability and within-cluster connectedness of the submanifolds, which together quantify the hardness of the manifold clustering problem. We further propose the {\it localized diffusion $K$-means} by using the local adaptive bandwidth estimated from the nearest neighbors. We show that exact recovery of the localized diffusion $K$-means is fully adaptive to the local probability density and geometric structures of the underlying submanifolds.