Goto

Collaborating Authors

 Clustering


Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit

arXiv.org Machine Learning

Subspace clustering methods based on $\ell_1$, $\ell_2$ or nuclear norm regularization have become very popular due to their simplicity, theoretical guarantees and empirical success. However, the choice of the regularizer can greatly impact both theory and practice. For instance, $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad conditions (e.g., arbitrary subspaces and corrupted data). However, it requires solving a large scale convex optimization problem. On the other hand, $\ell_2$ and nuclear norm regularization provide efficient closed form solutions, but require very strong assumptions to guarantee a subspace-preserving affinity, e.g., independent subspaces and uncorrupted data. In this paper we study a subspace clustering method based on orthogonal matching pursuit. We show that the method is both computationally efficient and guaranteed to give a subspace-preserving affinity under broad conditions. Experiments on synthetic data verify our theoretical analysis, and applications in handwritten digit and face clustering show that our approach achieves the best trade off between accuracy and efficiency.


Clustering on the Edge: Learning Structure in Graphs

arXiv.org Machine Learning

With the recent popularity of graphical clustering methods, there has been an increased focus on the information between samples. We show how learning cluster structure using edge features naturally and simultaneously determines the most likely number of clusters and addresses data scale issues. These results are particularly useful in instances where (a) there are a large number of clusters and (b) we have some labeled edges. Applications in this domain include image segmentation, community discovery and entity resolution. Our model is an extension of the planted partition model and our solution uses results of correlation clustering, which achieves a partition O(log(n))-close to the log-likelihood of the true clustering.


k-means clustering

@machinelearnbot

If there is not an operational definition for the number of clusters, yes, you have to figure this out yourself. You can use an algorithm to figure it out, but how do you know the algorithm is trading off the # clusters vs. compactness the way you want? You have to have some idea of what you want, of course, but usually in my consulting engagements where k was unknown we would do the following. First, we compute the mean values of all the input variables to get the gist of where the clusters are centered. You can compute the mean values of every variable in the clusters, but it could be that all the variables except one have the same mean for every cluster--it's just one variable that is really responsible for driving the formation of the clusters.


Classical Statistics and Statistical Learning in Imaging Neuroscience

arXiv.org Machine Learning

Neuroimaging research has predominantly drawn conclusions based on classical statistics, including null-hypothesis testing, t-tests, and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity, including cross-validation, pattern classification, and sparsity-inducing regression. These two methodological families used for neuroimaging data analysis can be viewed as two extremes of a continuum. Yet, they originated from different historical contexts, build on different theories, rest on different assumptions, evaluate different outcome metrics, and permit different conclusions. This paper portrays commonalities and differences between classical statistics and statistical learning with their relation to neuroimaging research. The conceptual implications are illustrated in three common analysis scenarios. It is thus tried to resolve possible confusion between classical hypothesis testing and data-guided model estimation by discussing their ramifications for the neuroimaging access to neurobiology.


Ethnicity sensitive author disambiguation using semi-supervised learning

arXiv.org Machine Learning

Author name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually curate their publications and claim those that are theirs. Indirectly, these tools allow for the inexpensive collection of large annotated training data, which can be further leveraged to build a complementary automated disambiguation system capable of inferring patterns for identifying publications written by the same person. Building on more than 1 million publicly released crowdsourced annotations, we propose an automated author disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing phonetic-based blocking strategies, thereby increasing recall; and (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary.


The Hidden Convexity of Spectral Clustering

arXiv.org Machine Learning

Partitioning a dataset into classes based on a similarity between data points, known as cluster analysis, is one of the most basic and practically important problems in data analysis and machine learning. It has a vast array of applications from speech recognition to image analysis to bioinformatics and to data compression. There is an extensive literature on the subject, including a number of different methodologies as well as their various practical and theoretical aspects [11]. In recent years spectral clustering--a class of methods based on the eigenvectors of a certain matrix, typically the graph Laplacian constructed from data--has become a widely used method for cluster analysis. This is due to the simplicity of the algorithm, a number of desirable properties it exhibits and its amenability to theoretical analysis. In its simplest form, spectral bi-partitioning is an attractively straightforward algorithm based on thresholding the second bottom eigenvector of the Laplacian matrix of a graph. However, the more practically significant problem of multiway spectral clustering is considerably more complex. While hierarchical methods based on a sequence of binary splits have been used, the most common approaches use k-means or weighted k-means clustering in the spectral space or related iterative procedures [17, 15, 2, 25].


Temporal Clustering of Time Series via Threshold Autoregressive Models: Application to Commodity Prices

arXiv.org Machine Learning

This study aimed to find temporal clusters for several commodity prices using the threshold nonlinear autoregressive model. It is expected that the process of determining the commodity groups that are time-dependent will advance the current knowledge about the dynamics of co-moving and coherent prices, and can serve as a basis for multivariate time series analyses. The clustering of commodity prices was examined using the proposed clustering approach based on time series models to incorporate the time varying properties of price series into the clustering scheme. Accordingly, the primary aim in this study was grouping time series according to the similarity between their Data Generating Mechanisms (DGMs) rather than comparing pattern similarities in the time series traces. The approximation to the DGM of each series was accomplished using threshold autoregressive models, which are recognized for their ability to represent nonlinear features in time series, such as abrupt changes, time-irreversibility and regime-shifting behavior. Through the use of the proposed approach, one can determine and monitor the set of co-moving time series variables across the time dimension. Furthermore, generating a time varying commodity price index and sub-indexes can become possible. Consequently, we conducted a simulation study to assess the effectiveness of the proposed clustering approach and the results are presented for both the simulated and real data sets. Keywords: Clustering Nonlinear Time Series Models, Regime Switching, Spectral 1. Introduction The movement of commodity prices and the associated dynamics are interrelated with economics and directly affect many industries.


Notebook on nbviewer

#artificialintelligence

There are a lot of clustering algorithms to choose from. The standard sklearn clustering suite has thirteen different clustering classes alone. So what clustering algorithms should you be using? As with every question in data science and machine learning it depends on your data. A number of those thirteen classes in sklearn are specialised for certain tasks (such as co-clustering and bi-clustering, or clustering features instead data points).


Examples -- scikit-learn 0.17.1 documentation

#artificialintelligence

This documentation is for scikit-learn version 0.17.1 -- Other versions If you use the software, please consider citing scikit-learn. Applications to real world problems with some medium sized datasets or interactive user interface. Examples illustrating the calibration of predicted probabilities of classifiers. Examples concerning model selection, mostly contained in the sklearn.grid_search


Graph Clustering Bandits for Recommendation

arXiv.org Machine Learning

Bandits are becoming an essential tool in modern recommenders systems [9, 12]. Most recommendation setting involve an ever changing dynamic set of items, in many domains such as news and ads recommendation the item set is changing so rapidly that is impossible to use standard collaborative filtering techniques. In these settings bandit algorithms such as contextual bandits have been proven to work well [10] since they provide a principled way to gauge the appeal of the new items. Yet, one drawback of contextual bandits is that they mainly work in a content-dependent regime, the user and item content features determine the preference scores so that any collaborative effects (joint user preferences over groups of items) that arise are being ignored. Incorporating collaborative effects into bandit algorithms can lead to a dramatic increase in the quality of recommendations. In bandit algorithms this has been mainly done by clustering the user. For instance, we may want to serve content to a group of users by taking advantage of an underlying network of preference relationships among them. These preference relationships can either be explicitly encoded in a graph, where adjacent nodes/users are deemed similar to one another, or implicitly contained in the data, and given as the outcome of an inference process that recognizes similarities across users based on their past behavior. To deal with this issue a new type of bandit algorithms has been developed which work under the assumption that users can be grouped (or clustered) based on their selection of items e.g.