Clustering
A Theoretical Analysis of Noisy Sparse Subspace Clustering on Dimensionality-Reduced Data
Wang, Yining, Wang, Yu-Xiang, Singh, Aarti
Subspace clustering is the problem of partitioning unlabeled data points into a number of clusters so that data points within one cluster lie approximately on a low-dimensional linear subspace. In many practical scenarios, the dimensionality of data points to be clustered are compressed due to constraints of measurement, computation or privacy. In this paper, we study the theoretical properties of a popular subspace clustering algorithm named sparse subspace clustering (SSC) and establish formal success conditions of SSC on dimensionality-reduced data. Our analysis applies to the most general fully deterministic model where both underlying subspaces and data points within each subspace are deterministically positioned, and also a wide range of dimensionality reduction techniques (e.g., Gaussian random projection, uniform subsampling, sketching) that fall into a subspace embedding framework (Meng & Mahoney, 2013; Avron et al., 2014). Finally, we apply our analysis to a differentially private SSC algorithm and established both privacy and utility guarantees of the proposed method.
Using Machine Learning to Measure Job Skill Similarities
This project involved implementing machine learning methodologies to identify similarities in job skills contained in resumes. An organization presented the project to the New York City Data Science Academy to explore whether Academy students might be interested in working on it. The three authors of this post, all students at the Academy at the time, agreed to take the project on. In formulating the analysis described in this post, the authors collaborated with several representatives of the organization. While the organization has asked us to refrain from disclosing its name at this time, the authors wish to convey their gratitude to the organization for the opportunity to work on the project as part of our studies at the Academy.
Anomaly Detection in Telecommunications Using Complex Streaming Data Whiteboard Walkthrough
In this week's Whiteboard Walkthrough Ted Dunning, Chief Application Architect at MapR, explains in detail how to use streaming IoT sensor data from handsets and devices as well as cell tower data to detect strange anomalies. He takes us from best practices for data architecture, including the advantages of multi-master writes with MapR Streams, through analysis of the telecom data using clustering methods to discover normal and anomalous behaviors. I'd like to talk a little bit about data processing in the context of telecom. In particular, what I'll be talking about is how to extend some of the ideas that we've talked about in other videos about anomaly detection to the particular case of telecom data. This is very different than some of the other data that we've been talking about.
Hybrid clustering-classification neural network in the medical diagnostics of reactive arthritis
Bodyanskiy, Yevgeniy, Vynokurova, Olena, Savvo, Volodymyr, Tverdokhlib, Tatiana, Mulesa, Pavlo
Self-organizing maps (SOM) and neural networks of learning vector quantization (LVQ) have seen extensive use for solving different problems in Data Mining domain (clustering, classification, fault detection and compression of information etc.). This type of neural networks was proposed by T. Kohonen [1, 2] and represents, in fact, a single-layer feedforward architecture, which provides an operator for mapping of input space into the output space. Operation-wise SOM and LVQ are quite similar to each neuron is fed input signal (sample) producing output, which is used during competition stage to determine winning neuron - usually the one with maximum output signal value. Vector of synaptic weights for winning neuron is the one closest to the input sample in terms of the metric chosen (which is Euclidian metric in most cases). Next is neurons adjustment phase.
Clustering by connection center evolution
The determination of cluster centers generally depends on the scale that we use to analyze the data to be clustered. Inappropriate scale usually leads to unreasonable cluster centers and thus unreasonable results. In this study, we first consider the similarity of elements in the data as the connectivity of nodes in an undirected graph, then present the concept of a connection center and regard it as the cluster center of the data. Based on this definition, the determination of cluster centers and the assignment of class are very simple, natural and effective. One more crucial finding is that the cluster centers of different scales can be obtained easily by the different powers of a similarity matrix and the change of power from small to large leads to the dynamic evolution of cluster centers from local (microscopic) to global (microscopic). Further, in this process of evolution, the number of categories changes discontinuously, which means that the presented method can automatically skip the unreasonable number of clusters, suggest appropriate observation scales and provide corresponding cluster results.
Modeling the Dynamics of Online Learning Activity
Mavroforakis, Charalampos, Valera, Isabel, Rodriguez, Manuel Gomez
Learning has become an online activity - people routinely use a wide variety of online learning platforms, ranging from wikis and question answering (Q&A) sites to online communities and blogs, to learn about a large range of topics. In this context, people find solutions to their problems by looking for closely related pieces of information, executing a sequence of queries or, more generally, performing a series of online actions. For example, a high school student may study several closely related wiki pages to prepare an essay about a historical event; a software developer may read several answers within a Q&A site to solve a specific programming problem; and, a researcher may check a specialized blog written by one of her peers to learn about a new concept or technique. All the above are examples of learning patterns, in which people perform a series of online actions - reading a wiki page, an answer, or a blog - to achieve a predefined goal - writing an essay, solving a programming problem, or learning about a new concept or technique. In this context, one may expect that people with similar goals undertake similar sequences of online actions and thus adopt similar learning patterns. Therefore, one could leverage the vast availability of online traces of users' learning activity to disambiguate among interleaved learning patterns adopted by individuals over time, as well as to automatically identify and track those people's interests and goals over time. In this work, we introduce a novel probabilistic model, the Hierarchical Dirichlet Hawkes Process (HDHP), for clustering continuous-time grouped streaming data, which we use to uncover the dynamics of learning activity on the web. The HDHP leverages the properties of the Hierarchical Dirichlet Process (HDP) [18], a popular Bayesian nonparametric model for clustering problems involving multiple groups of data, combined with the Hawkes process [13], a temporal point process particularly well fitted to model social activity [11, 19, 20]. In particular, the former is used to account for an infinite number of learning patterns, which are shared across users (groups) of an online learning platform.
Identifying collusion groups using spectral clustering
Sarswat, Suneel, Abraham, Kandathil Mathew, Ghosh, Subir Kumar
In an illiquid stock, traders can collude and place orders on a predetermined price and quantity at a fixed schedule. This is usually done to manipulate the price of the stock or to create artificial liquidity in the stock, which may mislead genuine investors. Here, the problem is to identify such group of colluding traders. We modeled the problem instance as a graph, where each trader corresponds to a vertex of the graph and trade corresponds to edges of the graph. Further, we assign weights on edges depending on total volume, total number of trades, maximum change in the price and commonality between two vertices. Spectral clustering algorithms are used on the constructed graph to identify colluding group(s). We have compared our results with simulated data to show the effectiveness of spectral clustering to detecting colluding groups. Moreover, we also have used parameters of real data to test the effectiveness of our algorithm.
Probabilistic Dimensionality Reduction via Structure Learning
We propose a novel probabilistic dimensionality reduction framework that can naturally integrate the generative model and the locality information of data. Based on this framework, we present a new model, which is able to learn a smooth skeleton of embedding points in a low-dimensional space from high-dimensional noisy data. The formulation of the new model can be equivalently interpreted as two coupled learning problem, i.e., structure learning and the learning of projection matrix. This interpretation motivates the learning of the embedding points that can directly form an explicit graph structure. We develop a new method to learn the embedding points that form a spanning tree, which is further extended to obtain a discriminative and compact feature representation for clustering problems. Unlike traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. This can greatly facilitate data visualization and scientific discovery in downstream analysis. Extensive experiments are performed that demonstrate that the proposed framework is able to obtain discriminative feature representations, and correctly recover the intrinsic structures of various real-world datasets.
Galactic Exchange Announces Integrated AppStore for ClusterGX(TM), the 5-Minute Install Big Data Clustering Platform - insideBIGDATA
Galactic Exchange, Inc. announced the availability of an integrated AppStore embedded within the UI of its flagship ClusterGX product line. California based Galactic Exchange has a vision to hugely simplify the deployment of Hadoop/Spark based infrastructures and associated business intelligence and machine learning applications, both on-premise and in the cloud. The integrated AppStore represents Phase-2 in our singular vision of making the deployment and activation of Big Data infrastructure and applications available to every business," said Rob Mustarde, President and CEO of Galactic Exchange. "Our ClusterGX platform already makes deployment of Hadoop/Spark clusters easier than anything else out there by an order of magnitude. Now with the AppStore, a user can launch enterprise grade applications with a single click." When a user launches an application from the integrated AppStore, ClusterGX will automatically deploy the application across the cluster whilst configuring the vendor specified Hadoop settings. If there are additional dependencies required -- such as Kafka, Impala or others -- ClusterGX will automatically deploy those as well, with no further input required from the user. Rocana is one of the first vendors to partner with Galactic Exchange on the AppStore. The ClusterGX platform represents a step-change in Big Data simplification," said Eric Sammer, CTO and co-founder at Rocana.
Unsupervised clustering under the Union of Polyhedral Cones (UOPC) model
Wang, Wenqi, Aggarwal, Vaneet, Aeron, Shuchin
In this paper, we consider clustering data that is assumed to come from one of finitely many pointed convex polyhedral cones. This model is referred to as the Union of Polyhedral Cones (UOPC) model. Similar to the Union of Subspaces (UOS) model where each data from each subspace is generated from a (unknown) basis, in the UOPC model each data from each cone is assumed to be generated from a finite number of (unknown) \emph{extreme rays}.To cluster data under this model, we consider several algorithms - (a) Sparse Subspace Clustering by Non-negative constraints Lasso (NCL), (b) Least squares approximation (LSA), and (c) K-nearest neighbor (KNN) algorithm to arrive at affinity between data points. Spectral Clustering (SC) is then applied on the resulting affinity matrix to cluster data into different polyhedral cones. We show that on an average KNN outperforms both NCL and LSA and for this algorithm we provide the deterministic conditions for correct clustering. For an affinity measure between the cones it is shown that as long as the cones are not very coherent and as long as the density of data within each cone exceeds a threshold, KNN leads to accurate clustering. Finally, simulation results on real datasets (MNIST and YaleFace datasets) depict that the proposed algorithm works well on real data indicating the utility of the UOPC model and the proposed algorithm.