Clustering
Detecting organized eCommerce fraud using scalable categorical clustering
Marchal, Samuel, Szyller, Sebastian
Online retail, eCommerce, frequently falls victim to fraud conducted by malicious customers (fraudsters) who obtain goods or services through deception. Fraud coordinated by groups of professional fraudsters that place several fraudulent orders to maximize their gain is referred to as organized fraud. Existing approaches to fraud detection typically analyze orders in isolation and they are not effective at identifying groups of fraudulent orders linked to organized fraud. These also wrongly identify many legitimate orders as fraud, which hinders their usage for automated fraud cancellation. We introduce a novel solution to detect organized fraud by analyzing orders in bulk. Our approach is based on clustering and aims to group together fraudulent orders placed by the same group of fraudsters. It selectively uses two existing techniques, agglomerative clustering and sampling to recursively group orders into small clusters in a reasonable amount of time. We assess our clustering technique on real-world orders placed on the Zalando website, the largest online apparel retailer in Europe1. Our clustering processes 100,000s of orders in a few hours and groups 35-45% of fraudulent orders together. We propose a simple technique built on top of our clustering that detects 26.2% of fraud while raising false alarms for only 0.1% of legitimate orders.
Gaussian Mixture Clustering Using Relative Tests of Fit
Chakravarti, Purvasha, Balakrishnan, Sivaraman, Wasserman, Larry
We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.
Weighted Clustering Ensemble: A Review
Clustering ensemble has emerged as a powerful tool for improving both the robustness and the stability of results from individual clustering methods. Weighted clustering ensemble arises naturally from clustering ensemble. One of the arguments for weighted clustering ensemble is that elements (clusterings or clusters) in a clustering ensemble are of different quality, or that objects or features are of varying significance. However, it is not possible to directly apply the weighting mechanisms from classification (supervised) domain to clustering (unsupervised) domain, also because clustering is inherently an ill-posed problem. This paper provides an overview of weighted clustering ensemble by discussing different types of weights, major approaches to determining weight values, and applications of weighted clustering ensemble to complex data. The unifying framework presented in this paper will help clustering practitioners select the most appropriate weighting mechanisms for their own problems.
Clustering Gaussian Graphical Models
We derive an efficient method to perform clustering of nodes in Gaussian graphical models directly from sample data. Nodes are clustered based on the similarity of their network neighborhoods, with edge weights defined by partial correlations. In the limited-data scenario, where the covariance matrix would be rank-deficient, we are able to make use of matrix factors, and never need to estimate the actual covariance or precision matrix. We demonstrate the method on functional MRI data from the Human Connectome Project. A matlab implementation of the algorithm is provided.
A Novel Graphical Lasso based approach towards Segmentation Analysis in Energy Game-Theoretic Frameworks
Das, Hari Prasanna, Konstantakopoulos, Ioannis C., Manasawala, Aummul Baneen, Veeravalli, Tanya, Liu, Huihan, Spanos, Costas J.
Energy game-theoretic frameworks have emerged to be a successful strategy to encourage energy efficient behavior in large scale by leveraging human-in-the-loop strategy. A number of such frameworks have been introduced over the years which formulate the energy saving process as a competitive game with appropriate incentives for energy efficient players. However, prior works involve an incentive design mechanism which is dependent on knowledge of utility functions for all the players in the game, which is hard to compute especially when the number of players is high, common in energy game-theoretic frameworks. Our research proposes that the utilities of players in such a framework can be grouped together to a relatively small number of clusters, and the clusters can then be targeted with tailored incentives. The key to above segmentation analysis is to learn the features leading to human decision making towards energy usage in competitive environments. We propose a novel graphical lasso based approach to perform such segmentation, by studying the feature correlations in a real-world energy social game dataset. To further improve the explainability of the model, we perform causality study using grangers causality. Proposed segmentation analysis results in characteristic clusters demonstrating different energy usage behaviors. We also present avenues to implement intelligent incentive design using proposed segmentation method.
Randomized Shortest Paths with Net Flows and Capacity Constraints
Courtain, Sylvain, Leleux, Pierre, Kivimaki, Ilkka, Guex, Guillaume, Saerens, Marco
This work extends the randomized shortest paths model (RSP) by investigating the net flow RSP and adding capacity constraints on edge flows. The standard RSP is a model of movement, or spread, through a network interpolating between a random walk and a shortest path behavior. This framework assumes a unit flow injected into a source node and collected from a target node with flows minimizing the expected transportation cost together with a relative entropy regularization term. In this context, the present work first develops the net flow RSP model considering that edge flows in opposite directions neutralize each other (as in electrical networks) and proposes an algorithm for computing the expected routing costs between all pairs of nodes. This quantity is called the net flow RSP dissimilarity measure between nodes. Experimental comparisons on node clustering tasks show that the net flow RSP dissimilarity is competitive with other state-of-the-art techniques. In the second part of the paper, it is shown how to introduce capacity constraints on edge flows and a procedure solving this constrained problem by using Lagrangian duality is developed. These two extensions improve significantly the scope of applications of the RSP framework.
Targeted sampling from massive Blockmodel graphs with personalized PageRank
Chen, Fan, Zhang, Yini, Rohe, Karl
This paper provides statistical theory and intuition for Personalized PageRank (PPR), a popular technique that samples a small community from a massive network. We study a setting where the entire network is expensive to thoroughly obtain or maintain, but we can start from a seed node of interest and "crawl" the network to find other nodes through their connections. By crawling the graph in a designed way, the PPR vector can be approximated without querying the entire massive graph, making it an alternative to snowball sampling. Using the Degree-Corrected Stochastic Blockmodel, we study whether the PPR vector can select nodes that belong to the same block as the seed node. We provide a simple and interpretable form for the PPR vector, highlighting its biases towards high degree nodes outside of the target block. We examine a simple adjustment based on node degrees and establish consistency results for PPR clustering that allows for directed graphs. We illustrate the method with the Twitter friendship graph and find that (i) the adjusted and unadjusted PPR techniques are complementary approaches, where the adjustment makes the results particularly localized around the seed node and (ii) the bias adjustment greatly benefits from degree regularization.
Sparse Popularity Adjusted Stochastic Block Model
Noroozi, Majid, Rimal, Ramchandra, Pensky, Marianna
The objective of the present paper is to study the Popularity Adjusted Block Model (PABM) in the sparse setting. Unlike in other block models, the flexibility of PABM allows to set some of the connection probabilities to zero while maintaining the rest of the probabilities non-negligible, leading to the Sparse Popularity Adjusted Block Model (SPABM). The latter reduces the size of parameter set and leads to improved precision of estimation and clustering. The theory is complemented by the simulation study and real data examples.
CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering
Feature selection is an important and challenging task in high dimensional clustering. For example, in genomics, there may only be a small number of genes that are differentially expressed, which are informative to the overall clustering structure. Existing feature selection methods, such as Sparse K-means, rarely tackle the problem of accounting features that can only separate a subset of clusters. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes or distinguish a pair of subtypes but not others. In this paper, we propose a K-means based clustering algorithm that discovers informative features as well as which cluster pairs are separable by each selected features. The method is essentially an EM algorithm, in which we introduce lasso-type constraints on each cluster pair in the M step, and make the E step possible by maximizing the raw cross-cluster distance instead of minimizing the intra-cluster distance. The results were demonstrated on simulated data and a leukemia gene expression dataset.
Learn Types of Machine Learning Algorithms with Ultimate Use Cases - DataFlair
In this article, we will study the various types of machine learning algorithms and their use-cases. We will study how Baidu is using supervised learning-based facial recognition for intelligent airport check-in and how Google is making use of Reinforcement Learning to develop an intelligent platform that would answer your queries. Machine Learning is a broad field, but it is classified into three classes of supervised, unsupervised and reinforcement learning. All these three paradigms are used everywhere to power intelligent applications. We will look at the important use cases of these paradigms and how they are revolutionizing our world today.