Clustering
Outcome-Oriented Predictive Process Monitoring: Review and Benchmark
Teinemaa, Irene, Dumas, Marlon, La Rosa, Marcello, Maggi, Fabrizio Maria
Traditional process monitoring techniques provide dashboards and reports showing the recent performance of a business process in terms of key performance indicators such as mean execution time, resource utilization or error rate with respect to a given notion of error. Predictive (business) process monitoring techniques go beyond traditional ones by making predictions about the future state of the executions of a business process (herein called cases). For example, a predictive monitoring technique may seek to predict the remaining execution time of each ongoing case of a process [29], the next activity that will be executed in each case [11], or the final outcome of a case, with respect to a possible set of business outcomes [23-25]. For instance, in an order-to-cash process (a process going from the receipt of a purchase order to the receipt of payment of the corresponding invoice), the possible outcomes of a case may be that the purchase order is closed satisfactorily (i.e., the customer accepted the products and paid) or unsatisfactorily (e.g., the order was canceled or withdrawn). Another set of possible outcomes is that the products were delivered on time (with respect to a maximum acceptable delivery time), or delivered late. Recent years have seen the emergence of a rich field of proposed methods for predictive process monitoring in general, and predictive monitoring of (categorical) case outcomes in particular - herein called outcome-oriented predictive process monitoring. Unfortunately, there is no unified approach to evaluate these methods. Indeed, different authors have used different datasets, experimental settings, evaluation measures and baselines.
Overlapping Clustering Models, and One (class) SVM to Bind Them All
Mao, Xueyu, Sarkar, Purnamrita, Chakrabarti, Deepayan
People belong to multiple communities, words belong to multiple topics, and books cover multiple genres; overlapping clusters are commonplace. Many existing overlapping clustering methods model each person (or word, or book) as a non-negative weighted combination of "exemplars" who belong solely to one community, with some small noise. Geometrically, each person is a point on a cone whose corners are these exemplars. This basic form encompasses the widely used Mixed Membership Stochastic Blockmodel of networks (Airoldi et al., 2008) and its degree-corrected variants (Karrer et al. 2011; Jin et al., 2017), as well as topic models such as LDA (Blei et al., 2003). We show that a simple one-class SVM yields provably consistent parameter inference for all such models, and scales to large datasets. Experimental results on several simulated and real datasets show our algorithm (called SVM-cone) is both accurate and scalable.
Task-Relevant Object Discovery and Categorization for Playing First-person Shooter Games
Liang, Junchi, Boularias, Abdeslam
We consider the problem of learning to play first-person shooter (FPS) video games using raw screen images as observations and keyboard inputs as actions. The high-dimensionality of the observations in this type of applications leads to prohibitive needs of training data for model-free methods, such as the deep Q-network (DQN), and its recurrent variant DRQN. Thus, recent works focused on learning low-dimensional representations that may reduce the need for data. This paper presents a new and efficient method for learning such representations. Salient segments of consecutive frames are detected from their optical flow, and clustered based on their feature descriptors. The clusters typically correspond to different discovered categories of objects. Segments detected in new frames are then classified based on their nearest clusters. Because only a few categories are relevant to a given task, the importance of a category is defined as the correlation between its occurrence and the agent's performance. The result is encoded as a vector indicating objects that are in the frame and their locations, and used as a side input to DRQN. Experiments on the game Doom provide a good evidence for the benefit of this approach.
Learning Treatment Regimens from Electronic Medical Records
Appropriate treatment regimens play a vital role in improving patient health status. Although some achievements have been made, few of the recent studies of learning treatment regimens have exploited different kinds of patient information due to the difficulty in adopting heterogeneous data to many data mining methods. Moreover, current studies seem too rigid with fixed intervals of treatment periods corresponding to the varying lengths of hospital stay. To this end, this work proposes a generic data-driven framework which can derive group-treatment regimens from electronic medical records by utilizing a mixed-variate restricted Boltzmann machine and incorporating medical domain knowledge. We conducted experiments on coronary artery disease as a case study. The obtained results show that the framework is promising and capable of assisting physicians in making clinical decisions.
Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!
This post is an experiment combining the result of t-SNE with two well known clustering techniques: k-means and hierarchical. This will be the practical section, in R. But also, this post will explore the intersection point of concepts like dimension reduction, clustering analysis, data preparation, PCA, HDBSCAN, k-NN, SOM, deep learning...and Carl Sagan! For those who don't know t-SNE technique (official site), it's a projection technique -or dimension reduction- similar in some aspects to Principal Component Analysis (PCA), used to visualize N variables into 2 (for example). When the t-SNE output is poor Laurens van der Maaten (t-SNE's author) says: As a sanity check, try running PCA on your data to reduce it to two dimensions.
Morse Theory and an Impossibility Theorem for Graph Clustering
Strazzeri, Fabio, Sánchez-García, Rubén J.
Kleinberg introduced three natural clustering properties, or axioms, and showed they cannot be simultaneously satisfied by any clustering algorithm. We present a new clustering property, Monotonic Consistency, which avoids the well-known problematic behaviour of Kleinberg's Consistency axiom, and the impossibility result. Namely, we describe a clustering algorithm, Morse Clustering, inspired by Morse Theory in Differential Topology, which satisfies Kleinberg's original axioms with Consistency replaced by Monotonic Consistency. Morse clustering uncovers the underlying flow structure on a set or graph and returns a partition into trees representing basins of attraction of critical vertices. We also generalise Kleinberg's axiomatic approach to sparse graphs, showing an impossibility result for Consistency, and a possibility result for Monotonic Consistency and Morse clustering.
Query K-means Clustering and the Double Dixie Cup Problem
Chien, I, Pan, Chao, Milenkovic, Olgica
We consider the problem of approximate $K$-means clustering with outliers and side information provided by same-cluster queries and possibly noisy answers. Our solution shows that, under some mild assumptions on the smallest cluster size, one can obtain an $(1+\epsilon)$-approximation for the optimal potential with probability at least $1-\delta$, where $\epsilon>0$ and $\delta\in(0,1)$, using an expected number of $O(\frac{K^3}{\epsilon \delta})$ noiseless same-cluster queries and comparison-based clustering of complexity $O(ndK + \frac{K^3}{\epsilon \delta})$, here, $n$ denotes the number of points and $d$ the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly $O(\frac{K^6}{\epsilon^3})$, at the cost of possibly missing very small clusters. We extend this settings to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Our proof techniques differ from previous methods used for $K$-means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. We illustrate the performance of the proposed algorithm both on synthetic and real datasets, including MNIST and CIFAR $10$.
Improved Density-Based Spatio--Textual Clustering on Social Media
Nguyen, Minh D., Shin, Won-Yong
DBSCAN may not be sufficient when the input data type is heterogeneous in terms of textual description. When we aim to discover clusters of geo-tagged records relevant to a particular point-of-interest (POI) on social media, examining only one type of input data (e.g., the tweets relevant to a POI) may draw an incomplete picture of clusters due to noisy regions. To overcome this problem, we introduce DBSTexC, a newly defined density-based clustering algorithm using spatio--textual information. We first characterize POI-relevant and POI-irrelevant tweets as the texts that include and do not include a POI name or its semantically coherent variations, respectively. By leveraging the proportion of POI-relevant and POI-irrelevant tweets, the proposed algorithm demonstrates much higher clustering performance than the DBSCAN case in terms of $\mathcal{F}_1$ score and its variants. While DBSTexC performs exactly as DBSCAN with the textually homogeneous inputs, it far outperforms DBSCAN with the textually heterogeneous inputs. Furthermore, to further improve the clustering quality by fully capturing the geographic distribution of tweets, we present fuzzy DBSTexC (F-DBSTexC), an extension of DBSTexC, which incorporates the notion of fuzzy clustering into the DBSTexC. We then demonstrate the robustness of F-DBSTexC via intensive experiments. The computational complexity of our algorithms is also analytically and numerically shown.
Pattern Dependence Detection using n-TARP Clustering
Yellamraju, Tarun, Boutin, Mireille
Consider an experiment involving a potentially small number of subjects. Some random variables are observed on each subject: a high-dimensional one called the "observed" random variable, and a one-dimensional one called the "outcome" random variable. We are interested in the dependencies between the observed random variable and the outcome random variable. We propose a method to quantify and validate the dependencies of the outcome random variable on the various patterns contained in the observed random variable. Different degrees of relationship are explored (linear, quadratic, cubic, ...). This work is motivated by the need to analyze educational data, which often involves high-dimensional data representing a small number of students. Thus our implementation is designed for a small number of subjects; however, it can be easily modified to handle a very large dataset. As an illustration, the proposed method is used to study the influence of certain skills on the course grade of students in a signal processing class. A valid dependency of the grade on the different skill patterns is observed in the data.
High-Dimensional Inference for Cluster-Based Graphical Models
Eisenach, Carson, Bunea, Florentina, Ning, Yang, Dinicu, Claudiu
Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models. Unlike standard graphical models, variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two different cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. We derive estimates tailored to these graphs, with the goal of pattern recovery under false discovery rate (FDR) control. Our study reveals that likelihood based inference for the latent graph is analytically intractable, and we develop alternative estimation and inference strategies. We replace the likelihood of the data by appropriate empirical risk functions that allow for valid inference in both graphical models under study. Our main results are Berry-Esseen central limit theorems for the proposed estimators, which are proved under weaker assumptions than those employed in the existing literature on Gaussian graphical model inference. We make explicit the implications of the asymptotic approximations on graph recovery under FDR control, and show when it can be controlled asymptotically. Our analysis takes into account the uncertainty induced by the initial clustering step. We find that the errors induced by clustering are asymptotically ignorable in the follow-up analysis, under no further restrictions on the parameter space for which inference is valid. The theoretical properties of the proposed procedures are verified on simulated data and an fMRI data analysis.