AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Detecting organized eCommerce fraud using scalable categorical clustering

Marchal, Samuel, Szyller, Sebastian

arXiv.org Machine LearningOct-10-2019

Online retail, eCommerce, frequently falls victim to fraud conducted by malicious customers (fraudsters) who obtain goods or services through deception. Fraud coordinated by groups of professional fraudsters that place several fraudulent orders to maximize their gain is referred to as organized fraud. Existing approaches to fraud detection typically analyze orders in isolation and they are not effective at identifying groups of fraudulent orders linked to organized fraud. These also wrongly identify many legitimate orders as fraud, which hinders their usage for automated fraud cancellation. We introduce a novel solution to detect organized fraud by analyzing orders in bulk. Our approach is based on clustering and aims to group together fraudulent orders placed by the same group of fraudsters. It selectively uses two existing techniques, agglomerative clustering and sampling to recursively group orders into small clusters in a reasonable amount of time. We assess our clustering technique on real-world orders placed on the Zalando website, the largest online apparel retailer in Europe1. Our clustering processes 100,000s of orders in a few hours and groups 35-45% of fraudulent orders together. We propose a simple technique built on top of our clustering that detects 26.2% of fraud while raising false alarms for only 0.1% of legitimate orders.

fraud, legitimate order, recagglo, (17 more...)

arXiv.org Machine Learning

1910.04514

Country:

Europe > France (0.04)
Europe > Switzerland (0.04)
Europe > Germany (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Retail (1.00)
Law Enforcement & Public Safety > Fraud (1.00)
Information Technology > Services > e-Commerce Services (0.61)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Gaussian Mixture Clustering Using Relative Tests of Fit

Chakravarti, Purvasha, Balakrishnan, Sivaraman, Wasserman, Larry

arXiv.org Machine LearningOct-6-2019

We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.

ab 1 2, equation, sigclust, (16 more...)

arXiv.org Machine Learning

1910.02566

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.87)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Weighted Clustering Ensemble: A Review

Zhang, Mimi

arXiv.org Machine LearningOct-6-2019

Clustering ensemble has emerged as a powerful tool for improving both the robustness and the stability of results from individual clustering methods. Weighted clustering ensemble arises naturally from clustering ensemble. One of the arguments for weighted clustering ensemble is that elements (clusterings or clusters) in a clustering ensemble are of different quality, or that objects or features are of varying significance. However, it is not possible to directly apply the weighting mechanisms from classification (supervised) domain to clustering (unsupervised) domain, also because clustering is inherently an ill-posed problem. This paper provides an overview of weighted clustering ensemble by discussing different types of weights, major approaches to determining weight values, and applications of weighted clustering ensemble to complex data. The unifying framework presented in this paper will help clustering practitioners select the most appropriate weighting mechanisms for their own problems.

algorithm, consensus, ensemble, (16 more...)

arXiv.org Machine Learning

1910.02433

Country:

North America > United States (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Overview (0.86)
Research Report (0.63)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(3 more...)

Add feedback

Clustering Gaussian Graphical Models

Dillon, Keith

arXiv.org Machine LearningOct-5-2019

We derive an efficient method to perform clustering of nodes in Gaussian graphical models directly from sample data. Nodes are clustered based on the similarity of their network neighborhoods, with edge weights defined by partial correlations. In the limited-data scenario, where the covariance matrix would be rank-deficient, we are able to make use of matrix factors, and never need to estimate the actual covariance or precision matrix. We demonstrate the method on functional MRI data from the Human Connectome Project. A matlab implementation of the algorithm is provided.

correlation, matrix, partial correlation, (16 more...)

arXiv.org Machine Learning

1910.02342

Country:

Oceania > New Zealand (0.04)
North America > United States > Massachusetts > Middlesex County > Natick (0.04)
North America > United States > Connecticut > New Haven County > West Haven (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

A Novel Graphical Lasso based approach towards Segmentation Analysis in Energy Game-Theoretic Frameworks

Das, Hari Prasanna, Konstantakopoulos, Ioannis C., Manasawala, Aummul Baneen, Veeravalli, Tanya, Liu, Huihan, Spanos, Costas J.

arXiv.org Machine LearningOct-5-2019

Energy game-theoretic frameworks have emerged to be a successful strategy to encourage energy efficient behavior in large scale by leveraging human-in-the-loop strategy. A number of such frameworks have been introduced over the years which formulate the energy saving process as a competitive game with appropriate incentives for energy efficient players. However, prior works involve an incentive design mechanism which is dependent on knowledge of utility functions for all the players in the game, which is hard to compute especially when the number of players is high, common in energy game-theoretic frameworks. Our research proposes that the utilities of players in such a framework can be grouped together to a relatively small number of clusters, and the clusters can then be targeted with tailored incentives. The key to above segmentation analysis is to learn the features leading to human decision making towards energy usage in competitive environments. We propose a novel graphical lasso based approach to perform such segmentation, by studying the feature correlations in a real-world energy social game dataset. To further improve the explainability of the model, we perform causality study using grangers causality. Proposed segmentation analysis results in characteristic clusters demonstrating different energy usage behaviors. We also present avenues to implement intelligent incentive design using proposed segmentation method.

correlation, energy efficiency, incentive design, (11 more...)

arXiv.org Machine Learning

1910.02217

Country:

North America > United States > California > Alameda County > Berkeley (0.05)
Asia > Singapore (0.05)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (1.00)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Randomized Shortest Paths with Net Flows and Capacity Constraints

Courtain, Sylvain, Leleux, Pierre, Kivimaki, Ilkka, Guex, Guillaume, Saerens, Marco

arXiv.org Machine LearningOct-4-2019

This work extends the randomized shortest paths model (RSP) by investigating the net flow RSP and adding capacity constraints on edge flows. The standard RSP is a model of movement, or spread, through a network interpolating between a random walk and a shortest path behavior. This framework assumes a unit flow injected into a source node and collected from a target node with flows minimizing the expected transportation cost together with a relative entropy regularization term. In this context, the present work first develops the net flow RSP model considering that edge flows in opposite directions neutralize each other (as in electrical networks) and proposes an algorithm for computing the expected routing costs between all pairs of nodes. This quantity is called the net flow RSP dissimilarity measure between nodes. Experimental comparisons on node clustering tasks show that the net flow RSP dissimilarity is competitive with other state-of-the-art techniques. In the second part of the paper, it is shown how to introduce capacity constraints on edge flows and a procedure solving this constrained problem by using Lagrangian duality is developed. These two extensions improve significantly the scope of applications of the RSP framework.

capacity constraint, constraint, node, (17 more...)

arXiv.org Machine Learning

1910.01849

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Belgium (0.04)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Targeted sampling from massive Blockmodel graphs with personalized PageRank

Chen, Fan, Zhang, Yini, Rohe, Karl

arXiv.org Machine LearningOct-3-2019

This paper provides statistical theory and intuition for Personalized PageRank (PPR), a popular technique that samples a small community from a massive network. We study a setting where the entire network is expensive to thoroughly obtain or maintain, but we can start from a seed node of interest and "crawl" the network to find other nodes through their connections. By crawling the graph in a designed way, the PPR vector can be approximated without querying the entire massive graph, making it an alternative to snowball sampling. Using the Degree-Corrected Stochastic Blockmodel, we study whether the PPR vector can select nodes that belong to the same block as the seed node. We provide a simple and interpretable form for the PPR vector, highlighting its biases towards high degree nodes outside of the target block. We examine a simple adjustment based on node degrees and establish consistency results for PPR clustering that allows for directed graphs. We illustrate the method with the Twitter friendship graph and find that (i) the adjusted and unadjusted PPR techniques are complementary approaches, where the adjustment makes the results particularly localized around the seed node and (ii) the bias adjustment greatly benefits from degree regularization.

graph, ppr vector, vector, (16 more...)

arXiv.org Machine Learning

1910.12937

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
North America > United States > Tennessee (0.04)
(12 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Television (1.00)
Media > News (1.00)
Leisure & Entertainment (1.00)
(5 more...)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
(2 more...)

Add feedback

Sparse Popularity Adjusted Stochastic Block Model

Noroozi, Majid, Rimal, Ramchandra, Pensky, Marianna

arXiv.org Machine LearningOct-3-2019

The objective of the present paper is to study the Popularity Adjusted Block Model (PABM) in the sparse setting. Unlike in other block models, the flexibility of PABM allows to set some of the connection probabilities to zero while maintaining the rest of the probabilities non-negligible, leading to the Sparse Popularity Adjusted Block Model (SPABM). The latter reduces the size of parameter set and leads to improved precision of estimation and clustering. The theory is complemented by the simulation study and real data examples.

matrix, node, probability, (16 more...)

arXiv.org Machine Learning

1910.01931

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications > Social Media (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

Zeng, Xiangrui, Zheng, Hongyu

arXiv.org Machine LearningOct-3-2019

Feature selection is an important and challenging task in high dimensional clustering. For example, in genomics, there may only be a small number of genes that are differentially expressed, which are informative to the overall clustering structure. Existing feature selection methods, such as Sparse K-means, rarely tackle the problem of accounting features that can only separate a subset of clusters. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes or distinguish a pair of subtypes but not others. In this paper, we propose a K-means based clustering algorithm that discovers informative features as well as which cluster pairs are separable by each selected features. The method is essentially an EM algorithm, in which we introduce lasso-type constraints on each cluster pair in the M step, and make the E step possible by maximizing the raw cross-cluster distance instead of minimizing the intra-cluster distance. The results were demonstrated on simulated data and a leukemia gene expression dataset.

algorithm, k-means, sparse 3-means, (15 more...)

arXiv.org Machine Learning

1909.12384

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.49)
Health & Medicine > Therapeutic Area > Hematology (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Learn Types of Machine Learning Algorithms with Ultimate Use Cases - DataFlair

#artificialintelligenceOct-1-2019, 11:48:19 GMT

In this article, we will study the various types of machine learning algorithms and their use-cases. We will study how Baidu is using supervised learning-based facial recognition for intelligent airport check-in and how Google is making use of Reinforcement Learning to develop an intelligent platform that would answer your queries. Machine Learning is a broad field, but it is classified into three classes of supervised, unsupervised and reinforcement learning. All these three paradigms are used everywhere to power intelligent applications. We will look at the important use cases of these paradigms and how they are revolutionizing our world today.

algorithm, learning, reinforcement learning, (13 more...)

#artificialintelligence

Country: Asia > China (0.05)

Industry: Transportation (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback