Goto

Collaborating Authors

 Clustering


How to Train a Machine Learning Model in JASP: Clustering - JASP - Free and User-Friendly Statistical Software

#artificialintelligence

This is a continuation of our series on machine learning methods that have been implemented in JASP (version 0.11 onwards). In this blog post we train a machine learning model to find clusters within our data set. The goal of a clustering task is to detect structures in the data. To do so, the algorithm needs to (1) identify the number of structures/groups in the data, and (2) figure out how the features are distributed in each group. For instance, clustering can be used to detect subgenres in electronic music, subgroups in a customer database, or to identify areas where there are greater incidences of particular types of crime.


IIT Madras and Queen's University Belfast develop technology to make Artificial Intelligence fairer

#artificialintelligence

Indian Institute of Technology Madras students were part of an international research project led by a Queen's University Belfast Researcher in the U.K. who has developed an innovative new algorithm to make Artificial Intelligence (AI) fairer and less biased when processing data. Dr. Deepak Padmanabhan, Researcher at Queen's University Belfast and Adjunct Faculty Member at IIT Madras, has been leading an international project, working with Ms. Savitha Abraham and Ms. Sowmya Sundaram, PhD Students, Department of Computer Science and Engineering, IIT Madras, to tackle the discrimination problem within clustering algorithms. Companies often use AI technologies to sift through huge amounts of data in situations such as an oversubscribed job vacancy or in policing when there is a large volume of CCTV data linked to a crime. However, while AI can save on time, the process is often biased in terms of race, gender, age, religion and country of origin. Dr. Padmanabhan said that AI techniques for exploratory data analysis, known as'clustering algorithms', are often criticised as being biased in terms of'sensitive attributes' such as race, gender, age, religion and country of origin.


Who Is Your Golden Goose? Learn With Cohort Analysis

#artificialintelligence

Customer segmentation is the technique of diving customers into groups based on their purchase patterns to identify who are the most profitable groups. In segmenting customers, various criteria can also be used depending on the market such as geographic, demographic characteristics or behavior bases. This technique assumes that groups with different features require different approaches to marketing and wants to figure out the groups who can boost their profitability the most. Today, we are going to discuss how to do customer segmentation analysis with the online retail dataset from UCI ML repo. This analysis will be focused on two steps getting the RFM values and making clusters with K-means algorithms.


Provable Noisy Sparse Subspace Clustering using Greedy Neighbor Selection: A Coherence-Based Perspective

arXiv.org Machine Learning

Sparse subspace clustering (SSC) using greedy-based neighbor selection, such as matching pursuit (MP) and orthogonal matching pursuit (OMP), has been known as a popular computationally-efficient alternative to the conventional L1-minimization based methods. Under deterministic bounded noise corruption, in this paper we derive coherence-based sufficient conditions guaranteeing correct neighbor identification using MP/OMP. Our analyses exploit the maximum/minimum inner product between two noisy data points subject to a known upper bound on the noise level. The obtained sufficient condition clearly reveals the impact of noise on greedy-based neighbor recovery. Specifically, it asserts that, as long as noise is sufficiently small so that the resultant perturbed residual vectors stay close to the desired subspace, both MP and OMP succeed in returning a correct neighbor subset. A striking finding is that, when the ground truth subspaces are well-separated from each other and noise is not large, MP-based iterations, while enjoying lower algorithmic complexity, yield smaller perturbation of residuals, thereby better able to identify correct neighbors and, in turn, achieving higher global data clustering accuracy. Extensive numerical experiments are used to corroborate our theoretical study.


Classifying Data Using Artificial Intelligence K-Means Clustering Algorithm

#artificialintelligence

"The primary aim of clustering is not just to make clusters, but to make good and meaningful ones" – Analytics Vidhya (https://www.analyticsvidhya.com/). An optimal multi-objective clustering is one of the most popular, and, at the same time, curious supervised machine learning problems, that occurs in many fields of computer science such as data and knowledge mining, data compression, vector quantization, patterns detection and classification, Voronoi diagrams, recommender engines (RE), etc. The process of clustering analysis itself allows us to reveal various of trends and insights exhibited on the input dataset. The cluster analysis (CA) process allows us to determine the similarities and differences between specific data, partitioning the data in such as a way that the similar data normally belongs to a specific group or cluster. For example, we can perform the clustering analysis of the data on a credit card customer to reveal what special offers should be given to a specific customer, based on the balance and loan amount criteria. In this case, all that we have to do is to partition all customers data into the number of clusters, and, then give the same offer to the similar customers. This is typically done by performing the multi-variate numerical data the multi-variate numerical data clustering analysis. The main goal of performing the actual clustering is to arrange a set of data items having an associated numeric n-dimensional vector of features into the number of homogeneous groups, called - "clusters".


Blocked Clusterwise Regression

arXiv.org Machine Learning

Such models have been shown to allow estimation and inference by regression clustering methods. This paper is motivated by the finding that the clustered heterogeneity models studied in this literature can be badly misspecified, even when the panel has significant discrete cross-sectional structure. To address this issue, we generalize previous approaches to discrete unobserved heterogeneity by allowing each unit to have multiple, imperfectly-correlated latent variables that describe its response-type to different covariates. We give inference results for a k-means style estimator of our model and develop information criteria to jointly select the number clusters for each latent variable. Monte Carlo simulations confirm our theoretical results and give intuition about the finite-sample performance of estimation and model selection. We also contribute to the theory of clustering with an over-specified number of clusters and derive new convergence rates for this setting. Our results suggest that over-fitting can be severe in k-means style estimators when the number of clusters is over-specified.


Python: Implementing a k-means algorithm with sklearn

#artificialintelligence

Originally posted by Michael Grogan. The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm. The purpose of k-means clustering is to be able to partition observations in a dataset into a specific number of clusters in order to aid in analysis of the data. From this perspective, it has particular value from a data visualisation perspective. The particular example used here is that of stock returns.


Survey of Network Intrusion Detection Methods from the Perspective of the Knowledge Discovery in Databases Process

arXiv.org Artificial Intelligence

The identification of cyberattacks which target information and communication systems has been a focus of the research community for years. Network intrusion detection is a complex problem which presents a diverse number of challenges. Many attacks currently remain undetected, while newer ones emerge due to the proliferation of connected devices and the evolution of communication technology. In this survey, we review the methods that have been applied to network data with the purpose of developing an intrusion detector, but contrary to previous reviews in the area, we analyze them from the perspective of the Knowledge Discovery in Databases (KDD) process. As such, we discuss the techniques used for the capture, preparation and transformation of the data, as well as, the data mining and evaluation methods. In addition, we also present the characteristics and motivations behind the use of each of these techniques and propose more adequate and up-to-date taxonomies and definitions for intrusion detectors based on the terminology used in the area of data mining and KDD. Special importance is given to the evaluation procedures followed to assess the different detectors, discussing their applicability in current real networks. Finally, as a result of this literature review, we investigate some open issues which will need to be considered for further research in the area of network security.


Comprehensive Analysis of Time Series Forecasting Using Neural Networks

arXiv.org Machine Learning

Time series forecasting has gained lots of attention recently; this is because many real-world phenomena can be modeled as time series. The massive volume of data and recent advancements in the processing power of the computers enable researchers to develop more sophisticated machine learning algorithms such as neural networks to forecast the time series data. In this paper, we propose various neural network architectures to forecast the time series data using the dynamic measurements; moreover, we introduce various architectures on how to combine static and dynamic measurements for forecasting. We also investigate the importance of performing techniques such as anomaly detection and clustering on forecasting accuracy. Our results indicate that clustering can improve the overall prediction time as well as improve the forecasting performance of the neural network. Furthermore, we show that feature-based clustering can outperform the distance-based clustering in terms of speed and efficiency. Finally, our results indicate that adding more predictors to forecast the target variable will not necessarily improve the forecasting accuracy.


Understanding K-Means Clustering using Python the easy way

#artificialintelligence

In the previous article, we studied the k-NN. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different or as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster's centroid is at the minimum.