Goto

Collaborating Authors

 Clustering


Spectral Clustering – How Math is Redefining Decision Making

@machinelearnbot

In today's world of big data and the internet of things, it is common for a business to find itself sitting atop a mountain of data. Possessing it is one thing, but leveraging it for data driven decision making is a much different ball game. Gut-feelings and institutionalized heuristics have traditionally been used to guide development of protocol and decision making, but the world of artificial intelligence and big disparate data is changing that. Everyone is trying to make sense of, and extract value from, their data. Those that are not will be left behind. This challenge (and opportunity) is not limited to certain industries.


Fast clustering algorithms for massive datasets

#artificialintelligence

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.


Kernelized Weighted SUSAN based Fuzzy C-Means Clustering for Noisy Image Segmentation

arXiv.org Machine Learning

-- The paper proposes a novel Kernelized image segmentation scheme for noisy images that utilizes the concept of Smallest Univalue Segment Assimilating Nucleus (SUSAN) and incorporates spatial constrai nts by computing circular colour map induced weights. Fuzzy damping coefficients are obtained for each nucleus or center pixel on the basis of the corresponding weighted SUSAN area values, the weights being equal to the inverse of the number of horizontal and vertical moves required to reach a neighborhood pixel from the center pixel. These weights are used to vary the contributions of the different nuclei in the Kernel based framework. The paper also presents an edge quality metric obtained by fuzzy decisi on based edge candidate selection and final computation of the blurriness of the edges after their selection. The inability of existing algorithms to preserve edge information and structural details in their segmented maps necessitates the computation of t he edge quality factor (EQF) for all the competing algorithms. Qualitative and quantitative analysis have been rendered with respect to state - of - the - art algorithms and for images ridden with varying types of noises. Speckle noise ridden SAR images and Rici an noise ridden Magnetic Resonance Images have also been considered for evaluating the effectiveness of the proposed algorithm in extracting important segmentation information. Image segmentation [1] constitutes an important part of image processing which has various applications in the fields of feature extraction and object recognition. The goal of image segmentation methods is to cluster t he pixels of an image into salient regions and hence these methods mainly involve various clustering techniques [2 - 6].


Clustering Similar Images Using MapReduce Style Feature Extraction with C# and R

@machinelearnbot

The createPairwiseMatches() function shown in Figure 7 above, extracts features in parallel mapping images to vertical and horizontal luminosity histograms. Furthermore, the histograms for each image are saved in a hash table for quick reference since each image's features will be repeatedly matched to other images. Once the match features are extracted, the match is immediately placed in a thread safe blocking collection for further downstream reduction processing. While the mapping functions shown in Figure 7 are executing in a background thread, parallel reduce functions simultaneously execute processing each completed match produced to calculate the similarity between the match images.


K-Means Clustering - Lazy Programmer

#artificialintelligence

K-means clustering is one of the simplest clustering algorithms one can use to find natural groupings of an unlabeled data set. Another way of stating this is that k-means clustering is an unsupervised learning algorithm. "learning the structure of X without being given Y". K-means clustering finds "k" different means (surprise surprise) which represent the centers of k clusters and assigns each data point to one of these clusters. The cluster it is assigned to is the one where the distance (usually Euclidean) from the point to the mean is smallest.


Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

arXiv.org Machine Learning

The present work proposes hybridization of Expectation-Maximization (EM) and K-Means techniques as an attempt to speed-up the clustering process. Though both K-Means and EM techniques look into different areas, K-means can be viewed as an approximate way to obtain maximum likelihood estimates for the means. Along with the proposed algorithm for hybridization, the present work also experiments with the Standard EM algorithm. Six different datasets are used for the experiments of which three are synthetic datasets. Clustering fitness and Sum of Squared Errors (SSE) are computed for measuring the clustering performance. In all the experiments it is observed that the proposed algorithm for hybridization of EM and K-Means techniques is consistently taking less execution time with acceptable Clustering Fitness value and less SSE than the standard EM algorithm. It is also observed that the proposed algorithm is producing better clustering results than the Cluster package of Purdue University.


Semantic Properties of Customer Sentiment in Tweets

arXiv.org Machine Learning

An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.


Clustering Time-Series Energy Data from Smart Meters

arXiv.org Machine Learning

Investigations have been performed into using clustering methods in data mining time-series data from smart meters. The problem is to identify patterns and trends in energy usage profiles of commercial and industrial customers over 24-hour periods, and group similar profiles. We tested our method on energy usage data provided by several U.S. power utilities. The results show accurate grouping of accounts similar in their energy usage patterns, and potential for the method to be utilized in energy efficiency programs.


Ward's Method for clustering in SAS

@machinelearnbot

It looks at cluster analysis as an analysis of variance problem. This method involves an agglomerative clustering algorithm. It starts out with n clusters of size 1 and continues until all the observations are included into one cluster. This method is most appropriate for quantitative variables, and not binary variables. Then you can set some threshold for the outlier clusters, like the size of that cluster is smaller then n*0.1%.


Predicting Glaucoma Visual Field Loss by Hierarchically Aggregating Clustering-based Predictors

arXiv.org Machine Learning

This study addresses the issue of predicting the glaucomatous visual field loss from patient disease datasets. Our goal is to accurately predict the progress of the disease in individual patients. As very few measurements are available for each patient, it is difficult to produce good predictors for individuals. A recently proposed clustering-based method enhances the power of prediction using patient data with similar spatiotemporal patterns. Each patient is categorized into a cluster of patients, and a predictive model is constructed using all of the data in the class. Predictions are highly dependent on the quality of clustering, but it is difficult to identify the best clustering method. Thus, we propose a method for aggregating cluster-based predictors to obtain better prediction accuracy than from a single cluster-based prediction. Further, the method shows very high performances by hierarchically aggregating experts generated from several cluster-based methods. We use real datasets to demonstrate that our method performs significantly better than conventional clustering-based and patient-wise regression methods, because the hierarchical aggregating strategy has a mechanism whereby good predictors in a small community can thrive.