Clustering
Learning From Hidden Traits: Joint Factor Analysis and Latent Clustering
Yang, Bo, Fu, Xiao, Sidiropoulos, Nicholas D.
Dimensionality reduction techniques play an essential role in data analytics, signal processing and machine learning. Dimensionality reduction is usually performed in a preprocessing stage that is separate from subsequent data analysis, such as clustering or classification. Finding reduced-dimension representations that are well-suited for the intended task is more appealing. This paper proposes a joint factor analysis and latent clustering framework, which aims at learning cluster-aware low-dimensional representations of matrix and tensor data. The proposed approach leverages matrix and tensor factorization models that produce essentially unique latent representations of the data to unravel latent cluster structure -- which is otherwise obscured because of the freedom to apply an oblique transformation in latent space. At the same time, latent cluster structure is used as prior information to enhance the performance of factorization. Specific contributions include several custom-built problem formulations, corresponding algorithms, and discussion of associated convergence properties. Besides extensive simulations, real-world datasets such as Reuters document data and MNIST image data are also employed to showcase the effectiveness of the proposed approaches.
Clustering Algorithms: From Start To State Of The Art
It's not a bad time to be a Data Scientist. Serious people may find interest in you if you turn the conversation towards "Big Data", and the rest of the party crowd will be intrigued when you mention "Artificial Intelligence" and "Machine Learning". Even Google thinks you're not bad, and that you're getting even better. There are a lot of'smart' algorithms that help data scientists do their wizardry. It may all seem complicated, but if we understand and organize algorithms a bit, it's not even that hard to find and apply the one that we need. Courses on data mining or machine learning will usually start with clustering, because it is both simple and useful. It is an important part of a somewhat wider area of Unsupervised Learning, where the data we want to describe is not labeled.
Geometry of Interest (GOI): Spatio-Temporal Destination Extraction and Partitioning in GPS Trajectory Data
Mousavi, Seyed Morteza, Harwood, Aaron, Karunasekera, Shanika, Maghrebi, Mojtaba
Noname manuscript No. (will be inserted by the editor) Abstract Nowadays large amounts of GPS trajectory data is being continuously collected by GPSenabled devices such as vehicles navigation systems and mobile phones. GPS trajectory data is useful for applications such as traffic management, location forecasting, and itinerary planning. Such applications often need to extract the time-stamped Sequence of Visited Locations (SVLs) of the mobile objects. The nearest neighbor query (NNQ) is the most applied method for labeling the visited locations based on the IDs of the POIs in the process of SVL generation. NNQ in some scenarios is not accurate enough. To improve the quality of the extracted SVLs, instead of using NNQ, we label the visited locations as the IDs of the POIs which geometrically intersect with the GPS observations. In this paper we propose a novel method for estimating the POIs and their GOIs, which consists of three phases: (i) extracting the geometries of the stay regions; (ii) constructing the geometry of destination regions based on the extracted stay regions; and (iii) constructing the GOIs based on the geometries of the destination regions. Using the geometric similarity to known GOIs as the major evaluation criterion, the experiments we performed using long-term GPS trajectory data show that our method outperforms the existing approaches. Keywords Trajectory Data, Spatio-Temporal Partitioning, Geometry of Interest, Time-Value, Time-Weighted Centroid, Destination Extraction 1 Introduction In recent years, GPS trajectory data has become abundant due to the many GPS enabled devices used on a daily basis. Mining these GPS trajectories for gathering useful information for applications has received a growing amount of attention in the recent literature. In this field, researchers have tried to derive knowledge for solving practical problems (e.g. The applications dealing with data analysis on trajectory data often need to have access to information about the significant places which a mobile object frequently travels and stay. These significant places are referred to as the points of interest (POIs).
Wisdom of Crowds cluster ensemble
Alizadeh, Hosein, Yousefnezhad, Muhammad, Bidgoli, Behrouz Minaei
The Wisdom of Crowds is a phenomenon described in social science that suggests four criteria applicable to groups of people. It is claimed that, if these criteria are satisfied, then the aggregate decisions made by a group will often be better than those of its individual members. Inspired by this concept, we present a novel feedback framework for the cluster ensemble problem, which we call Wisdom of Crowds Cluster Ensemble (WOCCE). Although many conventional cluster ensemble methods focusing on diversity have recently been proposed, WOCCE analyzes the conditions necessary for a crowd to exhibit this collective wisdom. These include decentralization criteria for generating primary results, independence criteria for the base algorithms, and diversity criteria for the ensemble members. We suggest appropriate procedures for evaluating these measures, and propose a new measure to assess the diversity. We evaluate the performance of WOCCE against some other traditional base algorithms as well as state-of-the-art ensemble methods. The results demonstrate the efficiency of WOCCE's aggregate decision-making compared to other algorithms.
Unsupervised Semantic Action Discovery from Video Collections
Sener, Ozan, Zamir, Amir Roshan, Wu, Chenxia, Savarese, Silvio, Saxena, Ashutosh
Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet. We propose a method for parsing a video into such semantic steps in an unsupervised way. Our method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish this using both visual and language cues in a joint generative model. Our method can also provide a textual description for each of the identified semantic steps and video segments. We evaluate our method on a large number of complex YouTube videos and show that our method discovers semantically correct instructions for a variety of tasks.
An efficient K-means algorithm for Massive Data
Capรณ, Marco, Pรฉrez, Aritz, Lozano, Josรฉ Antonio
Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm remains as one of the most popular clustering methods, in spite of its dependency on the initial settings and high computational cost, especially in terms of distance computations. In this work, we propose an efficient approximation to the K-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of sub- sets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the K-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms well-known approaches, such as the K-means++ and the minibatch K-means, in terms of the relation between number of distance computations and the quality of the approximation.
Clustering subgaussian mixtures by semidefinite programming
Mixon, Dustin G., Villar, Soledad, Ward, Rachel
We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei. The algorithm interprets the SDP output as a denoised version of the original data and then rounds this output to a hard clustering. We provide a generic method for proving performance guarantees for this algorithm, and we analyze the algorithm in the context of subgaussian mixture models. We also study the fundamental limits of estimating Gaussian centers by k-means clustering in order to compare our approximation guarantee to the theoretically optimal k-means clustering solution.
Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering
You, Chong, Li, Chun-Guang, Robinson, Daniel P., Vidal, Rene
State-of-the-art subspace clustering methods are based on expressing each data point as a linear combination of other data points while regularizing the matrix of coefficients with $\ell_1$, $\ell_2$ or nuclear norms. $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad theoretical conditions, but the clusters may not be connected. $\ell_2$ and nuclear norm regularization often improve connectivity, but give a subspace-preserving affinity only for independent subspaces. Mixed $\ell_1$, $\ell_2$ and nuclear norm regularizations offer a balance between the subspace-preserving and connectedness properties, but this comes at the cost of increased computational complexity. This paper studies the geometry of the elastic net regularizer (a mixture of the $\ell_1$ and $\ell_2$ norms) and uses it to derive a provably correct and scalable active set method for finding the optimal coefficients. Our geometric analysis also provides a theoretical justification and a geometric interpretation for the balance between the connectedness (due to $\ell_2$ regularization) and subspace-preserving (due to $\ell_1$ regularization) properties for elastic net subspace clustering. Our experiments show that the proposed active set method not only achieves state-of-the-art clustering performance, but also efficiently handles large-scale datasets.
Semi-Unsupervised Clustering Using Reinforcement Learning
Bose, Sourabh (The University of Texas at Arlington) | Huber, Manfred (The University of Texas at Arlington)
Clusters defined over a dataset by unsupervised clustering often present groupings which differ from the expected solution. This is primarily the case when some scarce knowledge of the problem exists beforehand that partially identifies desired characteristics of clusters. However conventional clustering algorithms are not defined to expect any supervision from the external world, as they are supposed to be completely unsupervised. As a result they can not benefit or effectively take into account available information about the use or properties of the clusters. In this paper we propose a reinforcement learning approach to address this problem where existing, unmodified unsupervised clustering algorithms are augmented in a way that the available sparse information is utilized to achieve more appropriate clusters. Our model works with any clustering algorithm, but the input to the algorithm, instead of being the original dataset, is a scaled version of the same, where the scaling factors are determined by the reinforcement learning algorithm.
Text Analysis 101; A Basic Understanding for Business Users: Clustering and Unsupervised Methods
This blog was originally posted as part of our Text Analysis 101 blog series. It aims to explain how the classification of text works as part of Natural Language Processing. It was the second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category "label" or "code" to a document or piece of text.