AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Finding Singular Features

Genovese, Christopher, Perone-Pacifico, Marco, Verdinelli, Isabella, Wasserman, Larry

arXiv.org Machine LearningJun-1-2016

We present a method for finding high density, low-dimensional structures in noisy point clouds. These structures are sets with zero Lebesgue measure with respect to the $D$-dimensional ambient space and belong to a $d

artificial intelligence, machine learning, singular feature, (18 more...)

arXiv.org Machine Learning

1606.00265

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)

Add feedback

Collaborative Filtering Bandits

Li, Shuai, Karatzoglou, Alexandros, Gentile, Claudio

arXiv.org Machine LearningMay-31-2016

Classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data. These approaches are far from ideal in highly dynamic recommendation domains such as news recommendation and computational advertisement, where the set of items and users is very fluid. In this work, we investigate an adaptive clustering technique for content recommendation based on exploration-exploitation strategies in contextual multi-armed bandit settings. Our algorithm takes into account the collaborative effects that arise due to the interaction of the users with the items, by dynamically grouping users based on the items under consideration and, at the same time, grouping items based on the similarity of the clusterings induced over the users. The resulting algorithm thus takes advantage of preference patterns in the data in a way akin to collaborative filtering methods. We provide an empirical analysis on medium-size real-world datasets, showing scalability and increased prediction performance (as measured by click-through rate) over state-of-the-art methods for clustering bandits. We also provide a regret analysis within a standard linear stochastic noise setting.

algorithm, social media, upstream oil & gas, (22 more...)

arXiv.org Machine Learning

1502.03473

Country:

North America > United States (0.28)
Europe > Italy (0.14)

Genre: Research Report > New Finding (0.93)

Industry:

Marketing (0.88)
Energy > Oil & Gas > Upstream (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data

Chalupka, Krzysztof, Bischoff, Tobias, Perona, Pietro, Eberhardt, Frederick

arXiv.org Machine LearningMay-30-2016

We show that the climate phenomena of El Nino and La Nina arise naturally as states of macro-variables when our recent causal feature learning framework (Chalupka 2015, Chalupka 2016) is applied to micro-level measures of zonal wind (ZW) and sea surface temperatures (SST) taken over the equatorial band of the Pacific Ocean. The method identifies these unusual climate states on the basis of the relation between ZW and SST patterns without any input about past occurrences of El Nino or La Nina. The simpler alternatives of (i) clustering the SST fields while disregarding their relationship with ZW patterns, or (ii) clustering the joint ZW-SST patterns, do not discover El Nino. We discuss the degree to which our method supports a causal interpretation and use a low-dimensional toy example to explain its success over other clustering approaches. Finally, we propose a new robust and scalable alternative to our original algorithm (Chalupka 2016), which circumvents the need for high-dimensional density learning.

artificial intelligence, machine learning, precision, (17 more...)

arXiv.org Machine Learning

1605.0937

Country:

North America > United States (0.28)
Asia (0.28)
Oceania (0.28)
Pacific Ocean (0.24)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Stagewise Learning for Sparse Clustering of Discretely-Valued Data

Zhao, Vincent, Zucker, Steven W.

arXiv.org Machine LearningMay-27-2016

We study the model-based sparse clustering problem for discrete data using a mixture model of product distributions [9, 7]. This model has application in many fields, including computational neurosciences, crowdsourcing and bioinformatics, and is interesting because it differs technically from the problem for continuous data, where the well-known Gaussian mixture model has been applied successfully. A fundamental difficulty is that, in high-dimensional datasets, some features can be noisy, redundant or generally uninformative for clustering, and these can push clustering algorithms toward inappropriate or uninteresting results. If these uninformative or noise data points could be eliminated then, we argue, the results should be much more satisfying. This is precisely our goal: to find an informative set of data points and to use these to drive the clustering.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1506.02975

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Combinatorial Topic Models using Small-Variance Asymptotics

Jiang, Ke, Sra, Suvrit, Kulis, Brian

arXiv.org Machine LearningMay-26-2016

Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are competitive with popular LDA-based topic modeling approaches, and also discuss the (dis)similarities between our approach and its probabilistic counterparts.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

1604.02027

Country: North America > United States (0.46)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.87)
(4 more...)

Add feedback

Compressive Spectral Clustering

Tremblay, Nicolas, Puy, Gilles, Gribonval, Remi, Vandergheynst, Pierre

arXiv.org Machine LearningMay-23-2016

Spectral clustering has become a popular technique due to its high performance in many contexts. It comprises three main steps: create a similarity graph between N objects to cluster, compute the first k eigenvectors of its Laplacian matrix to define a feature vector for each object, and run k-means on these features to separate objects into k classes. Each of these three steps becomes computationally intensive for large N and/or k. We propose to speed up the last two steps based on recent results in the emerging field of graph signal processing: graph filtering of random signals, and random sampling of bandlimited graph signals. We prove that our method, with a gain in computation time that can reach several orders of magnitude, is in fact an approximation of spectral clustering, for which we are able to control the error. We test the performance of our method on artificial and real-world network data.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1602.02018

Country:

North America > United States (0.46)
Europe (0.46)
Asia > Middle East (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Completing Low-Rank Matrices with Corrupted Samples from Few Coefficients in General Basis

Zhang, Hongyang, Lin, Zhouchen, Zhang, Chao

arXiv.org Machine LearningMay-23-2016

Subspace recovery from corrupted and missing data is crucial for various applications in signal processing and information theory. To complete missing values and detect column corruptions, existing robust Matrix Completion (MC) methods mostly concentrate on recovering a low-rank matrix from few corrupted coefficients w.r.t. standard basis, which, however, does not apply to more general basis, e.g., Fourier basis. In this paper, we prove that the range space of an $m\times n$ matrix with rank $r$ can be exactly recovered from few coefficients w.r.t. general basis, though $r$ and the number of corrupted samples are both as high as $O(\min\{m,n\}/\log^3 (m+n))$. Our model covers previous ones as special cases, and robust MC can recover the intrinsic matrix with a higher rank. Moreover, we suggest a universal choice of the regularization parameter, which is $\lambda=1/\sqrt{\log n}$. By our $\ell_{2,1}$ filtering algorithm, which has theoretical guarantees, we can further reduce the computational cost of our model. As an application, we also find that the solutions to extended robust Low-Rank Representation and to our extended robust MC are mutually expressible, so both our theory and algorithm can be applied to the subspace clustering problem with missing values under certain conditions. Experiments verify our theories.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/TIT.2016.2573311

1506.07615

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Nonstationary Distance Metric Learning

Greenewald, Kristjan, Kelley, Stephen, Hero, Alfred

arXiv.org Machine LearningMay-22-2016

Recent work in distance metric learning has focused on learning transformations of data that best align with provided sets of pairwise similarity and dissimilarity constraints. The learned transformations lead to improved retrieval, classification, and clustering algorithms due to the better adapted distance or similarity measures. Here, we introduce the problem of learning these transformations when the underlying constraint generation process is nonstationary. This nonstationarity can be due to changes in either the ground-truth clustering used to generate constraints or changes to the feature subspaces in which the class structure is apparent. We propose and evaluate COMID-SADL, an adaptive, online approach for learning and tracking optimal metrics as they change over time that is highly robust to a variety of nonstationary behaviors in the changing metric. We demonstrate COMID-SADL on both real and synthetic data sets and show significant performance improvements relative to previously proposed batch and online distance metric learning algorithms.

artificial intelligence, learner, machine learning, (12 more...)

arXiv.org Machine Learning

1603.03678

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre: Research Report (0.50)

Industry: Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Choosing the appropriate Clustering Algorithm (Video)

@machinelearnbotMay-21-2016, 11:40:15 GMT

This is a short video that contains the criteria that I use while choosing the appropriate clustering algorithm. If you have other criteria that you use, please do let me know by leaving a comment on my blog or by reaching out to me on Twitter @VRaoRao Thanks!

appropriate clustering algorithm, data mining, machine learning, (5 more...)

@machinelearnbot

Technology:

Information Technology > Communications > Social Media (0.89)
Information Technology > Data Science > Data Mining (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.78)

Add feedback

Optimal Cluster Recovery in the Labeled Stochastic Block Model

Yun, Se-Young, Proutiere, Alexandre

arXiv.org Machine LearningMay-21-2016

We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels. Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.

artificial intelligence, high probability, machine learning, (17 more...)

arXiv.org Machine Learning

1510.05956

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback