Goto

Collaborating Authors

 Clustering


How Co-clustering Can Discover Industrial Patterns – Hacker Noon

#artificialintelligence

Worse yet, this is not a fluke example. For many organizations in the industrial realm, it is still difficult to use large-scale data for knowledge discovery. In recent years, data organization and classification have evolved modestly. Analyzing vast and heterogeneous datasets is also a challenge thanks to the ballooning volume of acquired datasets. A technique known as data clustering can help, however.


Learning Robust Representations for Computer Vision

arXiv.org Machine Learning

Unsupervised learning techniques in computer vision often require learning latent representations, such as low-dimensional linear and non-linear subspaces. Noise and outliers in the data can frustrate these approaches by obscuring the latent spaces. Our main goal is deeper understanding and new development of robust approaches for representation learning. We provide a new interpretation for existing robust approaches and present two specific contributions: a new robust PCA approach, which can separate foreground features from dynamic background, and a novel robust spectral clustering method, that can cluster facial images with high accuracy. Both contributions show superior performance to standard methods on real-world test sets.


A generalized multivariate Student-t mixture model for Bayesian classification and clustering of radar waveforms

arXiv.org Machine Learning

In this paper, a generalized multivariate Student-t mixture model is developed for classification and clustering of Low Probability of Intercept radar waveforms. A Low Probability of Intercept radar signal is characterized by a pulse compression waveform which is either frequency-modulated or phase-modulated. The proposed model can classify and cluster different modulation types such as linear frequency modulation, non linear frequency modulation, polyphase Barker, polyphase P1, P2, P3, P4, Frank and Zadoff codes. The classification method focuses on the introduction of a new prior distribution for the model hyper-parameters that gives us the possibility to handle sensitivity of mixture models to initialization and to allow a less restrictive modeling of data. Inference is processed through a Variational Bayes method and a Bayesian treatment is adopted for model learning, supervised classification and clustering. Moreover, the novel prior distribution is not a well-known probability distribution and both deterministic and stochastic methods are employed to estimate its expectations. Some numerical experiments show that the proposed method is less sensitive to initialization and provides more accurate results than the previous state of the art mixture models.


Dynamic Clustering Algorithms via Small-Variance Analysis of Markov Chain Mixture Models

arXiv.org Machine Learning

Bayesian nonparametrics are a class of probabilistic models in which the model size is inferred from data. A recently developed methodology in this field is small-variance asymptotic analysis, a mathematical technique for deriving learning algorithms that capture much of the flexibility of Bayesian nonparametric inference algorithms, but are simpler to implement and less computationally expensive. Past work on small-variance analysis of Bayesian nonparametric inference algorithms has exclusively considered batch models trained on a single, static dataset, which are incapable of capturing time evolution in the latent structure of the data. This work presents a small-variance analysis of the maximum a posteriori filtering problem for a temporally varying mixture model with a Markov dependence structure, which captures temporally evolving clusters within a dataset. Two clustering algorithms result from the analysis: D-Means, an iterative clustering algorithm for linearly separable, spherical clusters; and SD-Means, a spectral clustering algorithm derived from a kernelized, relaxed version of the clustering problem. Empirical results from experiments demonstrate the advantages of using D-Means and SD-Means over contemporary clustering algorithms, in terms of both computational cost and clustering accuracy.


How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms

#artificialintelligence

While there's not necessarily a "correct" answer here, it's most likely you split the bugs into four clusters. That wasn't too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare -- or a passion for entomology -- you could probably even do the same with a hundred bugs. For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them. With a hundred bugs -- there'd be many times more solutions than there are particles in the known universe. In fact, there are more than four million billion googol solutions (what's a googol?).


R Clustering – A Tutorial for Cluster Analysis with R

#artificialintelligence

Clustering is a data segmentation technique that divides huge datasets into different groups on the basis of similarity in the data. It is a statistical operation of grouping objects. The resulting groups are clusters.


Estimating the Number of Clusters via Normalized Cluster Instability

arXiv.org Machine Learning

We improve existing instability-based methods for the selection of the number of clusters $k$ in cluster analysis by normalizing instability. In contrast to existing instability methods which only perform well for bounded sequences of small $k$, our method performs well across the whole sequence of possible $k$. In addition, we compare for the first time model-based and model-free variants of $k$ selection via cluster instability and find that their performance is similar. We make our method available in the R-package \verb+cstab+.


Which Spark machine learning API should you use?

#artificialintelligence

Remember, just because you get the algorithm to run doesn't mean the result isn't nonsense. If you're new to all of this, then the Machine Learning Foundations course on Coursera is a good place to start -- despite the creepy floating half-professor.


Sequential geophysical and flow inversion to characterize fracture networks in subsurface systems

arXiv.org Machine Learning

Subsurface applications including geothermal, geological carbon sequestration, oil and gas, etc., typically involve maximizing either the extraction of energy or the storage of fluids. Characterizing the subsurface is extremely complex due to heterogeneity and anisotropy. Due to this complexity, there are uncertainties in the subsurface parameters, which need to be estimated from multiple diverse as well as fragmented data streams. In this paper, we present a non-intrusive sequential inversion framework, for integrating data from geophysical and flow sources to constraint subsurface Discrete Fracture Networks (DFN). In this approach, we first estimate bounds on the statistics for the DFN fracture orientations using microseismic data. These bounds are estimated through a combination of a focal mechanism (physics-based approach) and clustering analysis (statistical approach) of seismic data. Then, the fracture lengths are constrained based on the flow data. The efficacy of this multi-physics based sequential inversion is demonstrated through a representative synthetic example.


Density Level Set Estimation on Manifolds with DBSCAN

arXiv.org Machine Learning

We show that DBSCAN can estimate the connected components of the $\lambda$-density level set $\{ x : f(x) \ge \lambda\}$ given $n$ i.i.d. samples from an unknown density $f$. We characterize the regularity of the level set boundaries using parameter $\beta > 0$ and analyze the estimation error under the Hausdorff metric. When the data lies in $\mathbb{R}^D$ we obtain a rate of $\widetilde{O}(n^{-1/(2\beta + D)})$, which matches known lower bounds up to logarithmic factors. When the data lies on an embedded unknown $d$-dimensional manifold in $\mathbb{R}^D$, then we obtain a rate of $\widetilde{O}(n^{-1/(2\beta + d\cdot \max\{1, \beta \})})$. Finally, we provide adaptive parameter tuning in order to attain these rates with no a priori knowledge of the intrinsic dimension, density, or $\beta$.