Goto

Collaborating Authors

 Clustering


A Latent Gaussian Mixture Model for Clustering Longitudinal Data

arXiv.org Machine Learning

Finite mixture models have become a popular tool for clustering. Amongst other uses, they have been applied for clustering longitudinal data and clustering high-dimensional data. In the latter case, a latent Gaussian mixture model is sometimes used. Although there has been much work on clustering using latent variables and on clustering longitudinal data, respectively, there has been a paucity of work that combines these features. An approach is developed for clustering longitudinal data with many time points based on an extension of the mixture of common factor analyzers model. A variation of the expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. The approach is illustrated using real and simulated data.


Latent Geometry Inspired Graph Dissimilarities Enhance Affinity Propagation Community Detection in Complex Networks

arXiv.org Machine Learning

Affinity propagation is one of the most effective algorithms for data clustering in high-dimensional feature space. However the numerous attempts to test its performance for community detection in real complex networks have been attaining results very far from the state of the art methods such as Infomap and Louvain. Yet, all these studies agreed that the crucial problem is to convert the network topology in a 'smart-enough' dissimilarity matrix that is able to properly address the message passing procedure behind affinity propagation clustering. Here we discuss how to leverage network latent geometry notions in order to design dissimilarity matrices for affinity propagation community detection. Our results demonstrate that the dissimilarity measures we designed bring affinity propagation to outperform current state of the art methods for community detection, not only on several original real networks, but also when their structure is corrupted by noise artificially induced by missing or spurious connectivity.


9 Off-the-beaten-path Statistical Science Topics with Interesting Applications

@machinelearnbot

You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many experienced data scientists with a PhD degree have not heard about some of these concepts. This is a well known model, used as a base stochastic process to model the logarithm of stock prices, yet it has interesting properties (depending on dimension) that few people know about.


Rademacher Complexity Bounds for a Penalized Multi-class Semi-supervised Algorithm

Journal of Artificial Intelligence Research

We propose Rademacher complexity bounds for multi-class classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing κ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold stands for clustering consistency. In the second step, a classifier is trained by minimizing a margin empirical loss over the labeled training set and a penalization term measuring the disability of the learner to predict the κ predominant classes of the identified clusters. The resulting data-dependent generalization error bound involves the margin distribution of the classifier, the stability of the clustering technique used in the first step and Rademacher complexity terms corresponding to partially labeled training data. Our theoretical result exhibit convergence rates extending those proposed in the literature for the binary case, and experimental results on different multi-class classification problems show empirical evidence that supports the theory.


Dynamic Multivariate Functional Data Modeling via Sparse Subspace Learning

arXiv.org Machine Learning

Multivariate functional data from a complex system are naturally high-dimensional and have complex cross-correlation structure. The complexity of data structure can be observed as that (1) some functions are strongly correlated with similar features, while some others may have almost no cross-correlations with quite diverse features; and (2) the cross-correlation structure may also change over time due to the system evolution. With this regard, this paper presents a dynamic subspace learning method for multivariate functional data modeling. In particular, we consider different functions come from different subspaces, and only functions of the same subspace have cross-correlations with each other. The subspaces can be automatically formulated and learned by reformatting the problem as a sparse regression. By allowing but regularizing the regression change over time, we can describe the cross-correlation dynamics. The model can be efficiently estimated by the fast iterative shrinkage-thresholding algorithm (FISTA), and the features of every subspace can be extracted using the smooth multi-channel functional PCA. Numerical studies together with case studies demonstrate the efficiency and applicability of the proposed methodology.


Clustering Based Unsupervised Learning – Towards Data Science

#artificialintelligence

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations). While there is an exhaustive list of clustering algorithms available (whether you use R or Python's Scikit-Learn), I will attempt to cover the basic concepts. The most common and simplest clustering algorithm out there is the K-Means clustering. This algorithms involve you telling the algorithms how many possible cluster (or K) there are in the dataset. The algorithm then iteratively moves the k-centers and selects the datapoints that are closest to that centroid in the cluster.


Unsupervised Learning of Mixture Models with a Uniform Background Component

arXiv.org Machine Learning

Gaussian Mixture Models are one of the most studied and mature models in unsupervised learning. However, outliers are often present in the data and could influence the cluster estimation. In this paper, we study a new model that assumes that data comes from a mixture of a number of Gaussians as well as a uniform "background" component assumed to contain outliers and other non-interesting observations. We develop a novel method based on robust loss minimization that performs well in clustering such GMM with a uniform background. We give theoretical guarantees for our clustering algorithm to obtain best clustering results with high probability. Besides, we show that the result of our algorithm does not depend on initialization or local optima, and the parameter tuning is an easy task. By numeric simulations, we demonstrate that our algorithm enjoys high accuracy and achieves the best clustering results given a large enough sample size.


Supervised vs. Unsupervised Learning

#artificialintelligence

Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points. Supervised learning is typically done in the context of classification, when we want to map input to output labels, or regression, when we want to map input to a continuous output.


MOG: Mapper on Graphs for Relationship Preserving Clustering

arXiv.org Machine Learning

The interconnected nature of graphs often results in difficult to interpret clutter. Typically techniques focus on either decluttering by clustering nodes with similar properties or grouping edges with similar relationship. We propose using mapper, a powerful topological data analysis tool, to summarize the structure of a graph in a way that both clusters data with similar properties and preserves relationships. Typically, mapper operates on a given data by utilizing a scalar function defined on every point in the data and a cover for scalar function codomain. The output of mapper is a graph that summarize the shape of the space. In this paper, we outline how to use this mapper construction on an input graphs, outline three filter functions that capture important structures of the input graph, and provide an interface for interactively modifying the cover. To validate our approach, we conduct several case studies on synthetic and real world data sets and demonstrate how our method can give meaningful summaries for graphs with various complexities


Vanlearning: A Machine Learning SaaS Application for People Without Programming Backgrounds

arXiv.org Machine Learning

Abstract--Although we have tons of machine learning tools to analyze data, most of them require users have some programming backgrounds. Here we introduce a SaaS application which allows users analyze their data without any coding and even without any knowledge of machine learning. Users can upload, train, predict and download their data by simply clicks their mouses. Our system uses data pre-processor and validator to relieve the computational cost of our server. The simple architecture of Vanlearning helps developers can easily maintain and extend it.