Clustering
Biclustering with Alternating K-Means
Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing exclusive row and column biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
Understanding and Exploiting Dependent Variables with Deep Metric Learning
Mahony, Niall O', Campbell, Sean, Carvalho, Anderson, Krpalkova, Lenka, Velasco-Hernandez, Gustavo, Riordan, Daniel, Walsh, Joseph
Deep Metric Learning (DML) approaches learn to represent inputs to a lower-dimensional latent space such that the distance between representations in this space corresponds with a predefined notion of similarity. This paper investigates how the mapping element of DML may be exploited in situations where the salient features in arbitrary classification problems vary over time or due to changing underlying variables. Examples of such variable features include seasonal and time-of-day variations in outdoor scenes in place recognition tasks for autonomous navigation and age/gender variations in human/animal subjects in classification tasks for medical/ethological studies. Through the use of visualisation tools for observing the distribution of DML representations per each query variable for which prior information is available, the influence of each variable on the classification task may be better understood. Based on these relationships, prior information on these salient background variables may be exploited at the inference stage of the DML approach by using a clustering algorithm to improve classification performance. This research proposes such a methodology establishing the saliency of query background variables and formulating clustering algorithms for better separating latent-space representations at run-time. The paper also discusses online management strategies to preserve the quality and diversity of data and the representation of each class in the gallery of embeddings in the DML approach. We also discuss latent works towards understanding the relevance of underlying/multiple variables with DML.
CONDA-PM -- A Systematic Review and Framework for Concept Drift Analysis in Process Mining
Elkhawaga, Ghada, Abuelkheir, Mervat, Barakat, Sherif I., Riad, Alaa M., Reichert, Manfred
Business processes evolve over time to adapt to changing business environments. This requires continuous monitoring of business processes to gain insights into whether they conform to the intended design or deviate from it. The situation when a business process changes while being analysed is denoted as Concept Drift. Its analysis is concerned with studying how a business process changes, in terms of detecting and localising changes and studying the effects of the latter. Concept drift analysis is crucial to enable early detection and management of changes, that is, whether to promote a change to become part of an improved process, or to reject the change and make decisions to mitigate its effects. Despite its importance, there exists no comprehensive framework for analysing concept drift types, affected process perspectives, and granularity levels of a business process. This article proposes the CONcept Drift Analysis in Process Mining (CONDA-PM) framework describing phases and requirements of a concept drift analysis approach. CONDA-PM was derived from a Systematic Literature Review (SLR) of current approaches analysing concept drift. We apply the CONDA-PM framework on current approaches to concept drift analysis and evaluate their maturity. Applying CONDA-PM framework highlights areas where research is needed to complement existing efforts.
Dual-constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior
Zhang, Yan, Zhang, Zhao, Wang, Yang, Zhang, Zheng, Zhang, Li, Yan, Shuicheng, Wang, Meng
Nonnegative matrix factorization is usually powerful for learning the "shallow" parts-based representation, but it clearly fails to discover deep hierarchical information within both the basis and representation spaces. In this paper, we technically propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net, for learning the hierarchical coupled representations. To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network. Specifically, DS2CF-Net designs a deep coupled factorization architecture using multi-layers of linear transformations, which coupled updates the bases and new representations in each layer. To improve the discriminating ability of learned deep representations and deep coefficients, our network clearly considers enriching the supervised prior by the joint deep coefficients-regularized label prediction, and incorporates enriched prior information as additional label and structure constraints. The label constraint can enable the samples of the same label to have the same coordinate in the new feature space, while the structure constraint forces the coefficient matrices in each layer to be block-diagonal so that the enhanced prior using the self-expressive label propagation are more accurate. Our network also integrates the adaptive dual-graph learning to retain the local manifold structures of both the data manifold and feature manifold by minimizing the reconstruction errors in each layer. Extensive experiments on several real databases demonstrate that our DS2CF-Net can obtain state-of-the-art performance for representation learning and clustering.
Topology-based Clusterwise Regression for User Segmentation and Demand Forecasting
Rivera-Castro, Rodrigo, Pletnev, Aleksandr, Pilyugina, Polina, Diaz, Grecia, Nazarov, Ivan, Zhu, Wanyi, Burnaev, Evgeny
Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.
Top 5 Machine Learning Algorithms used by Data Scientists with Python: Part-1
Machine learning is an important Artificial Intelligence technique that can perform a task effectively by learning through experience. According to Forbes, machine learning will replace 25% of the jobs within the next 10 years. One of the most popular real-world applications of Machine Learning is classification. It corresponds to a task that occurs commonly in everyday life. For example, a hospital may want to classify medical patients into those who are at high, medium or low risk of acquiring a certain illness, an opinion polling company may wish to classify people interviewed into those who are likely to vote for each of several political parties or are undecided, or we may wish to classify a student project as distinction, merit, pass or fail.
GraphKKE: Graph Kernel Koopman Embedding for Human Microbiome Analysis
Melnyk, Kateryna, Klus, Stefan, Montavon, Grรฉgoire, Conrad, Tim
More and more diseases have been found to be strongly correlated with disturbances in the microbiome constitution, e.g., obesity, diabetes, or some cancer types. Thanks to modern high-throughput omics technologies, it becomes possible to directly analyze human microbiome and its influence on the health status. Microbial communities are monitored over long periods of time and the associations between their members are explored. These relationships can be described by a time-evolving graph. In order to understand responses of the microbial community members to a distinct range of perturbations such as antibiotics exposure or diseases and general dynamical properties, the time-evolving graph of the human microbial communities has to be analyzed. This becomes especially challenging due to dozens of complex interactions among microbes and metastable dynamics. The key to solving this problem is the representation of the time-evolving graphs as fixed-length feature vectors preserving the original dynamics. We propose a method for learning the embedding of the time-evolving graph that is based on the spectral analysis of transfer operators and graph kernels. We demonstrate that our method can capture temporary changes in the time-evolving graph on both created synthetic data and real-world data. Our experiments demonstrate the efficacy of the method. Furthermore, we show that our method can be applied to human microbiome data to study dynamic processes.
Learning Inter- and Intra-manifolds for Matrix Factorization-based Multi-Aspect Data Clustering
Abstract--Clustering on the data with multiple aspects, such as multi-view or multi-type relational data, has become popular in recent years due to their wide applicability. The approach using manifold learning with the Nonnegative Matrix Factorization (NMF) framework, that learns the accurate low-rank representation of the multidimensional data, has shown effectiveness. We propose to include the inter-manifold in the NMF framework, utilizing the distance information of data points of different data types (or views) to learn the diverse manifold for data clustering. Empirical analysis reveals that the proposed method can find partial representations of various interrelated types and select useful features during clustering. Results on several datasets demonstrate that the proposed method outperforms the state-of-the-art multi-aspect data clustering methods in both accuracy and efficiency. This can be (1) multi-view data where samples For instance, in Figure 1.a, three intra-type relationship are represented by multiple views; or (2) multi-type matrices will store intra-similarities between Webpages, relational data (MTRD) where samples are represented by between Terms and between Hyperlinks, and three interrelationships different data types and their inherent relationships.
Gradient-based Competitive Learning: Theory
Cirrincione, Giansalvo, Barbiero, Pietro, Ciravegna, Gabriele, Randazzo, Vincenzo
Deep learning has been widely used for supervised learning and classification/regression problems. Recently, a novel area of research has applied this paradigm to unsupervised tasks; indeed, a gradient-based approach extracts, efficiently and autonomously, the relevant features for handling input data. However, state-of-the-art techniques focus mostly on algorithmic efficiency and accuracy rather than mimic the input manifold. On the contrary, competitive learning is a powerful tool for replicating the input distribution topology. This paper introduces a novel perspective in this area by combining these two techniques: unsupervised gradient-based and competitive learning. The theory is based on the intuition that neural networks are able to learn topological structures by working directly on the transpose of the input matrix. At this purpose, the vanilla competitive layer and its dual are presented. The former is just an adaptation of a standard competitive layer for deep clustering, while the latter is trained on the transposed matrix. Their equivalence is extensively proven both theoretically and experimentally. However, the dual layer is better suited for handling very high-dimensional datasets. The proposed approach has a great potential as it can be generalized to a vast selection of topological learning tasks, such as non-stationary and hierarchical clustering; furthermore, it can also be integrated within more complex architectures such as autoencoders and generative adversarial networks.
Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension reduction & clustering
Paul, Debolina, Chakraborty, Saptarshi, Li, Didong, Dunson, David
Clustering of data into groups of relatively similar observations is one of the canonical tasks in unsupervised learning. With an increasing focus in recent years on very richly parameterized models, there has been a corresponding emphasis in the literature on complex clustering algorithms. A popular theme has been on clustering on the latent variable level, while allowing estimation of both the clustering structure and a complex nonlinear mapping from the latent to observed data level. Such methods are appealing in being able to realistically generate data that are indistinguishable from the observed data, while clustering observations in a lower-dimensional space. A particularly popular strategy is to develop clustering algorithms based on variational autoencoders (VAEs). For example, instead of drawing the latent variables in a VAE from standard Gaussian distributions, one can use a mixture of Gaussians for model-based clustering (Dilokthanakul et al., 2016; Lim et al., 2020; Yang et al., 2019). The problem with this family of methods is that, with a rich enough deep neural network, VAEs can accurately approximate any data generating distribution regardless of the continuous density placed on the latent variables. If one uses a richer family of densities, such as a mixture model, then one can potentially approximate the data distribution using a simpler neural network structure. However, the inferred clusters are not reliable due to problems of non-identifiability.