Goto

Collaborating Authors

 Clustering


The pallettes of Earth

#artificialintelligence

Take a satellite image, and extract the pixels into a uniform 3-D color space. Then run a clustering algorithm on those pixels, to extract a number of clusters. The centroids of those clusters them make a representative palette of the image. The R package earthtones by Will Cornwell, Mitch Lyons, and Nick Murray -- now available on CRAN -- does all this for you. Pass the get_earthtones function a latitude and longitude, and it will grab the Google Earth tile at the requested zoom level (8 works well for cities) and generate a palette with the desired number of colors.


Clustering Made Simple with Spotfire

#artificialintelligence

Data clustering is the process of grouping items together based on similarities between the items of a group. Clustering can be used for data compression, data mining, pattern recognition, and machine learning. Examples of applications include clustering consumers into market segments, classifying manufactured units by their failure signatures, identifying crime hot spots, and identifying regions with similar geographical characteristics. Once clusters are defined, the next step may be to build a predictive model. TIBCO Spotfire makes it easy to perform clustering with these two popular out of box user-friendly solutions: 1. K-means Clustering 2. Hierarchical Clustering The k-means method is a popular and simple approach to perform clustering and Spotfire line charts help visualize data before performing calculations.


Statistical Properties of the Single Linkage Hierarchical Clustering Estimator

arXiv.org Machine Learning

Distance-based hierarchical clustering (HC) methods are widely used in unsupervised data analysis but few authors take account of uncertainty in the distance data. We incorporate a statistical model of the uncertainty through corruption or noise in the pairwise distances and investigate the problem of estimating the HC as unknown parameters from measurements. Specifically, we focus on single linkage hierarchical clustering (SLHC) and study its geometry. We prove that under fairly reasonable conditions on the probability distribution governing measurements, SLHC is equivalent to maximum partial profile likelihood estimation (MPPLE) with some of the information contained in the data ignored. At the same time, we show that direct evaluation of SLHC on maximum likelihood estimation (MLE) of pairwise distances yields a consistent estimator. Consequently, a full MLE is expected to perform better than SLHC in getting the correct HC results for the ground truth metric.


Machine learning: Clustering and classification on the campaign trail

#artificialintelligence

As the election season rampages on, we categorize voters into broad demographics -- soccer moms, NASCAR dads, blacks, whites, ALICEs, yuppies -- in an attempt to understand and discuss this complex, churning electorate. In doing so we're tapping into something fundamental about how we perceive the world: not as a sequence of singular individuals, but rather as a massive set of overlapping taxonomies that, taken together, comprise an impressively structured human experience. With fewer than 20 yes/no queries on category membership we can often identify a single object amidst a staggering breadth of possibilities. We've grouped everything that we know to exist and the groupings themselves are the primary subject of our thoughts. We can go the other direction as well -- taking an object and placing it in its many groups.


Discovering Patterns in Time-Varying Graphs: A Triclustering Approach

arXiv.org Machine Learning

This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis.


R Addict Blog

#artificialintelligence

Machine and statistical learning wizards are becoming more eager to perform analysis with Spark ML library if this is only possible. It's trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let's say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems.


Clustering and Community Detection with Imbalanced Clusters

arXiv.org Machine Learning

Spectral clustering methods which are frequently used in clustering and community detection applications are sensitive to the specific graph constructions particularly when imbalanced clusters are present. We show that ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced cluster sizes since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced cluster sizes. Our approach parameterizes a family of graphs by adaptively modulating node degrees on a fixed node set, yielding a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach and demonstrate the superiority of our method through experiments on synthetic and real datasets for data clustering, semi-supervised learning and community detection.


Community Detection and Classification in Hierarchical Stochastic Blockmodels

arXiv.org Machine Learning

We propose a robust, scalable, integrated methodology for community detection and community comparison in graphs. In our procedure, we first embed a graph into an appropriate Euclidean space to obtain a low-dimensional representation, and then cluster the vertices into communities. We next employ nonparametric graph inference techniques to identify structural similarity among these communities. These two steps are then applied recursively on the communities, allowing us to detect more fine-grained structure. We describe a hierarchical stochastic blockmodel---namely, a stochastic blockmodel with a natural hierarchical structure---and establish conditions under which our algorithm yields consistent estimates of model parameters and motifs, which we define to be stochastically similar groups of subgraphs. Finally, we demonstrate the effectiveness of our algorithm in both simulated and real data. Specifically, we address the problem of locating similar subcommunities in a partially reconstructed Drosophila connectome and in the social network Friendster.


Incremental Minimax Optimization based Fuzzy Clustering for Large Multi-view Data

arXiv.org Machine Learning

Incremental clustering approaches have been proposed for handling large data when given data set is too large to be stored. The key idea of these approaches is to find representatives to represent each cluster in each data chunk and final data analysis is carried out based on those identified representatives from all the chunks. However, most of the incremental approaches are used for single view data. As large multi-view data generated from multiple sources becomes prevalent nowadays, there is a need for incremental clustering approaches to handle both large and multi-view data. In this paper we propose a new incremental clustering approach called incremental minimax optimization based fuzzy clustering (IminimaxFCM) to handle large multi-view data. In IminimaxFCM, representatives with multiple views are identified to represent each cluster by integrating multiple complementary views using minimax optimization. The detailed problem formulation, updating rules derivation, and the in-depth analysis of the proposed IminimaxFCM are provided. Experimental studies on several real world multi-view data sets have been conducted. We observed that IminimaxFCM outperforms related incremental fuzzy clustering in terms of clustering accuracy, demonstrating the great potential of IminimaxFCM for large multi-view data analysis.


Multi-View Fuzzy Clustering with Minimax Optimization for Effective Clustering of Data from Multiple Sources

arXiv.org Machine Learning

Multi-view data clustering refers to categorizing a data set by making good use of related information from multiple representations of the data. It becomes important nowadays because more and more data can be collected in a variety of ways, in different settings and from different sources, so each data set can be represented by different sets of features to form different views of it. Many approaches have been proposed to improve clustering performance by exploring and integrating heterogeneous information underlying different views. In this paper, we propose a new multi-view fuzzy clustering approach called MinimaxFCM by using minimax optimization based on well-known Fuzzy c means. In MinimaxFCM the consensus clustering results are generated based on minimax optimization in which the maximum disagreements of different weighted views are minimized. Moreover, the weight of each view can be learned automatically in the clustering process. In addition, there is only one parameter to be set besides the fuzzifier. The detailed problem formulation, updating rules derivation, and the in-depth analysis of the proposed MinimaxFCM are provided here. Experimental studies on nine multi-view data sets including real world image and document data sets have been conducted. We observed that MinimaxFCM outperforms related multi-view clustering approaches in terms of clustering accuracy, demonstrating the great potential of MinimaxFCM for multi-view data analysis.