AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Adaptive Estimation in Structured Factor Models with Applications to Overlapping Clustering

Bing, Xin, Bunea, Florentina, Ning, Yang, Wegkamp, Marten

arXiv.org Machine LearningMar-21-2018

This work introduces a novel estimation method, called LOVE, of the entries and structure of a loading matrix A in a sparse latent factor model X = AZ + E, for an observable random vector X in Rp, with correlated unobservable factors Z \in RK, with K unknown, and independent noise E. Each row of A is scaled and sparse. In order to identify the loading matrix A, we require the existence of pure variables, which are components of X that are associated, via A, with one and only one latent factor. Despite the fact that the number of factors K, the number of the pure variables, and their location are all unknown, we only require a mild condition on the covariance matrix of Z, and a minimum of only two pure variables per latent factor to show that A is uniquely defined, up to signed permutations. Our proofs for model identifiability are constructive, and lead to our novel estimation method of the number of factors and of the set of pure variables, from a sample of size n of observations on X. This is the first step of our LOVE algorithm, which is optimization-free, and has low computational complexity of order p2. The second step of LOVE is an easily implementable linear program that estimates A. We prove that the resulting estimator is minimax rate optimal up to logarithmic factors in p. The model structure is motivated by the problem of overlapping variable clustering, ubiquitous in data science. We define the population level clusters as groups of those components of X that are associated, via the sparse matrix A, with the same unobservable latent factor, and multi-factor association is allowed. Clusters are respectively anchored by the pure variables, and form overlapping sub-groups of the p-dimensional random vector X. The Latent model approach to OVErlapping clustering is reflected in the name of our algorithm, LOVE.

artificial intelligence, machine learning, matrix, (18 more...)

arXiv.org Machine Learning

1704.06977

Country: North America > United States (0.46)

Genre:

Research Report (0.64)
Workflow (0.48)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)

Add feedback

Machine Learning on DARWIN Datasets (MLD-I) Darwinex Blog

#artificialintelligenceMar-20-2018, 11:13:46 GMT

Machine learning in essence, is the research and application of algorithms that help us better understand data. By leveraging statistical learning techniques from the realm of machine learning, practitioners are able to draw meaningful inferences from and turn data into actionable intelligence. Furthermore, the availability of several open source machine learning tools, platforms and libraries today enables absolutely anyone to break into this field, utilizing a plethora of powerful algorithms to discover exploitable patterns in data and predict future outcomes. This development in particular has given rise to a new wave of DIY retail traders, creating sophisticated trading strategies that compete (and in some cases, outperform others) in a space previously dominated by just institutional participants. In this introductory blog post, we will discuss supportive reasoning for, and different categories of machine learning.

artificial intelligence, learning, machine learning, (12 more...)

#artificialintelligence

Industry: Banking & Finance > Trading (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.30)

Add feedback

Generating Redundant Features with Unsupervised Multi-Tree Genetic Programming

Lensen, Andrew, Xue, Bing, Zhang, Mengjie

arXiv.org Artificial IntelligenceMar-20-2018

Recently, feature selection has become an increasingly important area of research due to the surge in high-dimensional datasets in all areas of modern life. A plethora of feature selection algorithms have been proposed, but it is difficult to truly analyse the quality of a given algorithm. Ideally, an algorithm would be evaluated by measuring how well it removes known bad features. Acquiring datasets with such features is inherently difficult, and so a common technique is to add synthetic bad features to an existing dataset. While adding noisy features is an easy task, it is very difficult to automatically add complex, redundant features. This work proposes one of the first approaches to generating redundant features, using a novel genetic programming approach. Initial experiments show that our proposed method can automatically create difficult, redundant features which have the potential to be used for creating high-quality feature selection benchmark datasets. Keywords: Genetic Programming, Feature Creation, Feature Construction, Feature Selection, Mutual Information, Evolutionary Computation 1 Introduction Feature Selection (FS) techniques aim to remove features from a dataset which are less useful than others.

artificial intelligence, evolutionary algorithm, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-319-77553-1_6

1802.00554

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals

Hoyos-Idrobo, Andrés, Varoquaux, Gaël, Kahn, Jonas, Thirion, Bertrand

arXiv.org Machine LearningMar-18-2018, 19:00:00 GMT

In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is that good clustering comes with large algorithmic costs. We address it by contributing a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional variance-minimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publicly-available data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme.

approximation, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

doi: 10.1109/TPAMI.2018.2815524

1609.04608

Country:

Europe > France (0.28)
North America > United States (0.28)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Rare Feature Selection in High Dimensions

Yan, Xiaohan, Bien, Jacob

arXiv.org Machine LearningMar-18-2018

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.

data mining, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

1803.06675

Country:

Europe (0.45)
North America > United States > California (0.27)

Genre: Research Report (0.63)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Estimation of lactate threshold with machine learning techniques in recreational runners

Etxegarai, Urtats, Portillo, Eva, Irazusta, Jon, Arriandiaga, Ander, Cabanes, Itziar

arXiv.org Machine LearningMar-18-2018

Lactate threshold is considered an essential parameter when assessing performance of elite and recreational runners and prescribing training intensities in endurance sports. However, the measurement of blood lactate concentration requires expensive equipment and the extraction of blood samples, which are inconvenient for frequent monitoring. Furthermore, most recreational runners do not have access to routine assessment of their physical fitness by the aforementioned equipment so they are not able to calculate the lactate threshold without resorting to an expensive and specialized centre. Therefore, the main objective of this study is to create an intelligent system capable of estimating the lactate threshold of recreational athletes participating in endurance running sports. The solution here proposed is based on a machine learning system which models the lactate evolution using recurrent neural networks and includes the proposal of standardization of the temporal axis as well as a modification of the stratified sampling method. The results show that the proposed system accurately estimates the lactate threshold of 89.52% of the athletes and its correlation with the experimentally measured lactate threshold is very high (R=0,89). Moreover, its behaviour with the test dataset is as good as with the training set, meaning that the generalization power of the model is high. Therefore, in this study a machine learning based system is proposed as alternative to the traditional invasive lactate threshold measurement tests for recreational runners.

artificial intelligence, lactate threshold, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1016/j.asoc.2017.11.036

1803.0603

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.54)

Industry:

Leisure & Entertainment > Sports > Running (1.00)
Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Provable Estimation of the Number of Blocks in Block Models

Yan, Bowei, Sarkar, Purnamrita, Cheng, Xiuyuan

arXiv.org Machine LearningMar-17-2018, 19:00:00 GMT

Community detection is a fundamental unsupervised learning problem for unlabeled networks which has a broad range of applications. Many community detection algorithms assume that the number of clusters $r$ is known apriori. In this paper, we propose an approach based on semi-definite relaxations, which does not require prior knowledge of model parameters like many existing convex relaxation methods and recovers the number of clusters and the clustering matrix exactly under a broad parameter regime, with probability tending to one. On a variety of simulated and real data experiments, we show that the proposed method often outperforms state-of-the-art techniques for estimating the number of clusters.

artificial intelligence, machine learning, matrix, (15 more...)

arXiv.org Machine Learning

1705.0858

Country: North America > United States (0.68)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Provable Convex Co-clustering of Tensors

Chi, Eric C., Gaines, Brian R., Sun, Will Wei, Zhou, Hua, Yang, Jian

arXiv.org Machine LearningMar-17-2018

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and is both computationally and storage efficient. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

artificial intelligence, machine learning, tensor, (14 more...)

arXiv.org Machine Learning

1803.06518

Country: North America > United States > California > Los Angeles County > Los Angeles (0.27)

Genre: Research Report (1.00)

Industry:

Marketing (0.69)
Information Technology (0.66)
Government > Regional Government > North America Government (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hidden Integrality of SDP Relaxation for Sub-Gaussian Mixture Models

Fei, Yingjie, Chen, Yudong

arXiv.org Machine LearningMar-17-2018

We consider the problem of estimating the discrete clustering structures under Sub-Gaussian Mixture Models. Our main results establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded in terms of the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models, and our results reveal the "global-to-local" mechanism that drives the performance of the SDP relaxation. A corollary of our results shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our results establish sufficient conditions for the SDP to correctly recover the cluster memberships of $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. As a special case, we show that under the $d$-dimensional Stochastic Ball Model, SDP achieves non-trivial (sometimes exact) recovery when the center separation is as small as $\sqrt{1/d}$, which complements previous exact recovery results that require constant separation.

artificial intelligence, machine learning, relaxation, (17 more...)

arXiv.org Machine Learning

1803.0651

Genre: Research Report > New Finding (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Data Science (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

S-Isomap++: Multi Manifold Learning from Streaming Data

Mahapatra, Suchismit, Chandola, Varun

arXiv.org Machine LearningMar-17-2018

Manifold learning based methods have been widely used for non-linear dimensionality reduction (NLDR). However, in many practical settings, the need to process streaming data is a challenge for such methods, owing to the high computational complexity involved. Moreover, most methods operate under the assumption that the input data is sampled from a single manifold, embedded in a high dimensional space. We propose a method for streaming NLDR when the observed data is either sampled from multiple manifolds or irregularly sampled from a single manifold. We show that existing NLDR methods, such as Isomap, fail in such situations, primarily because they rely on smoothness and continuity of the underlying manifold, which is violated in the scenarios explored in this paper. However, the proposed algorithm is able to learn effectively in presence of multiple, and potentially intersecting, manifolds, while allowing for the input data to arrive as a massive stream.

data mining, machine learning, manifold, (18 more...)

arXiv.org Machine Learning

doi: 10.1109/BigData.2017.8257987

1710.06462

Genre: Research Report (0.82)

Industry: Education (0.71)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback