AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

What are Gaussian Mixture Models? A Powerful Clustering Algorithm

#artificialintelligenceNov-1-2019, 04:52:51 GMT

They offer a completely different challenge to a supervised learning problem – there's much more room for experimenting with the data that I have. It's no wonder that the majority of developments and breakthroughs in the machine learning space are happening in the unsupervised learning domain. And one of the most popular techniques in unsupervised learning is clustering. It's a concept we typically learn early on in our machine learning journey and it's simple enough to grasp. I'm sure you've come across or even worked on projects like customer segmentation, market basket analysis, etc.

algorithm, gaussian distribution, gaussian mixture model, (12 more...)

#artificialintelligence

Industry: Education (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

A Simple and Efficient Method to Compute a Single Linkage Dendrogram

Zhu, Huanbiao, Stuetzle, Werner

arXiv.org Machine LearningNov-1-2019

We address the problem of computing a single linkage dendrogram. A possible approach is to: (i) Form an edge weighted graph $G$ over the data, with edge weights reflecting dissimilarities. (ii) Calculate the MST $T$ of $G$. (iii) Break the longest edge of $T$ thereby splitting it into subtrees $T_L$, $T_R$. (iv) Apply the splitting process recursively to the subtrees. This approach has the attractive feature that Prim's algorithm for MST construction calculates distances as needed, and hence there is no need to ever store the inter-point distance matrix. The recursive partitioning algorithm requires us to determine the vertices (and edges) of $T_L$ and $T_R$. We show how this can be done easily and efficiently using information generated by Prim's algorithm without any additional computational cost.

algorithm, prim, vertex, (12 more...)

arXiv.org Machine Learning

1911.00223

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Research and application of time series algorithms in centralized purchasing data

Bai, Yun, Jia, Suling, Li, Xixi

arXiv.org Artificial IntelligenceNov-1-2019

Based on the online transaction data of COSCO group's centralized procurement platform, this paper studies the clustering method of time series type data. The different methods of similarity calculation, different clustering methods with different K values are analysed, and the best clustering method suitable for centralized purchasing data is determined. The company list under the corresponding cluster is obtained. The time series motif discovery algorithm is used to model the centroid of each cluster. Through ARIMA method, we also made 12 periods of prediction for the centroid of each category. This paper constructs a matrix of "Customer Lifecycle Theory - Five Elements of Marketing ", and puts forward corresponding marketing suggestions for customers at different life cycle stages.

algorithm, international conference, time sery, (14 more...)

arXiv.org Artificial Intelligence

1911.00449

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.26)
South America > Brazil > São Paulo (0.04)
Europe > Spain > Galicia > Madrid (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.70)

Industry: Government (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data (Revised)

Guggilam, Sreelekha, Zaidi, Syed M. A., Chandola, Varun, Patra, Abani K.

arXiv.org Machine LearningOct-31-2019

Most current clustering based anomaly detection methods use scoring schema and thresholds to classify anomalies. These methods are often tailored to target specific data sets with "known" number of clusters. The paper provides a streaming clustering and anomaly detection algorithm that does not require strict arbitrary thresholds on the anomaly scores or knowledge of the number of clusters while performing probabilistic anomaly detection and clustering simultaneously. This ensures that the cluster formation is not impacted by the presence of anomalous data, thereby leading to more reliable definition of "normal vs abnormal" behavior. The motivations behind developing the INCAD model and the path that leads to the streaming model is discussed.

anomaly detection, detection, probability, (11 more...)

arXiv.org Machine Learning

doi: 10.1007/978-3-030-22747-0_4

1911.00184

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.56)

Add feedback

Regularized Non-negative Spectral Embedding for Clustering

Wang, Yifei, Liu, Rui, Chen, Yong, Zhangs, Hui, Ye, Zhiwen

arXiv.org Machine LearningOct-31-2019

--Spectral Clustering is a popular technique to split data points into groups, especially for complex datasets. The algorithms in the Spectral Clustering family typically consist of multiple separate stages (such as similarity matrix construction, low-dimensional embedding, and K-Means clustering as post-processing), which may lead to sub-optimal results because of the possible mismatch between different stages. In this paper, we propose an end-to-end single-stage learning method to clustering called Regularized Nonnegative Spectral Embedding (RNSE) which extends Spectral Clustering with the adaptive learning of similarity matrix and meanwhile utilizes nonnegative constraints to facilitate one-step clustering (directly from data points to clustering labels). Two well-founded methods, successive alternating projection and strategic multiplicative update, are employed to work out the quite challenging optimization problems in RNSE. Extensive experiments on both synthetic and real-world datasets demonstrate RNSE's superior clustering performance to some state-of-the-art competitors. I NTRODUCTION Clustering is an important unsupervised learning task which aims to group a set of data objects into clusters in such a way that objects in the same cluster are more similar to each other than those in different clusters. For complex datasets, Spectral Clustering [1] and its many variants [2]- [4] are particularly popular due to their ability of discovering highly non-convex clusters.

dataset, matrix, optimization problem, (12 more...)

arXiv.org Machine Learning

1911.00179

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Oceania > Australia (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Support vector clustering - Scholarpedia

#artificialintelligenceOct-30-2019, 03:58:43 GMT

The objective of clustering is to partition a data set into groups according to some criterion in an attempt to organize data into a more meaningful form. There are many ways of achieving this goal. Clustering may proceed according to some parametric model or by grouping points according to some distance or similarity measure as in hierarchical clustering. A natural way to put cluster boundaries is in regions in data space where there is little data, i.e. in "valleys" in the probability distribution of the data. This is the path taken in support vector clustering (SVC), which is based on the support vector approach (see Ben-Hur et al., 2001).

boundary, contour, mathbf, (16 more...)

#artificialintelligence

Country: North America > United States > Colorado > Larimer County > Fort Collins (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

Meta-Learning to Cluster

Jiang, Yibo, Verma, Nakul

arXiv.org Machine LearningOct-30-2019

Clustering is one of the most fundamental and wide-spread techniques in exploratory data analysis. Yet, the basic approach to clustering has not really changed: a practitioner hand-picks a task-specific clustering loss to optimize and fit the given data to reveal the underlying cluster structure. Some types of losses---such as k-means, or its non-linear version: kernelized k-means (centroid based), and DBSCAN (density based)---are popular choices due to their good empirical performance on a range of applications. Although every so often the clustering output using these standard losses fails to reveal the underlying structure, and the practitioner has to custom-design their own variation. In this work we take an intrinsically different approach to clustering: rather than fitting a dataset to a specific clustering loss, we train a recurrent model that learns how to cluster. The model uses as training pairs examples of datasets (as input) and its corresponding cluster identities (as output). By providing multiple types of training datasets as inputs, our model has the ability to generalize well on unseen datasets (new clustering tasks). Our experiments reveal that by training on simple synthetically generated datasets or on existing real datasets, we can achieve better clustering performance on unseen real-world datasets when compared with standard benchmark clustering techniques. Our meta clustering model works well even for small datasets where the usual deep learning models tend to perform worse.

algorithm, dataset, synthetic dataset, (15 more...)

arXiv.org Machine Learning

1910.14134

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
North America > United States > New York > Queens County > New York City (0.04)
(7 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Deterministic Weighted Automata with Queries and Counterexamples

Weiss, Gail, Goldberg, Yoav, Yahav, Eran

arXiv.org Machine LearningOct-30-2019

We present an algorithm for extraction of a probabilistic deterministic finite automaton (PDFA) from a given black-box language model, such as a recurrent neural network (RNN). The algorithm is a variant of the exact-learning algorithm L*, adapted to a probabilistic setting with noise. The key insight is the use of conditional probabilities for observations, and the introduction of a local tolerance when comparing them. When applied to RNNs, our algorithm often achieves better word error rate (WER) and normalised distributed cumulative gain (NDCG) than that achieved by spectral extraction of weighted finite automata (WFA) from the same networks. PDFAs are substantially more expressive than n-grams, and are guaranteed to be stochastic and deterministic - unlike spectrally extracted WFAs.

algorithm, pdfa, rnn, (16 more...)

arXiv.org Machine Learning

1910.13895

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
(6 more...)

Genre: Research Report (0.64)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Machine Learning, Part 3: Unsupervised Learning

#artificialintelligenceOct-28-2019, 23:38:31 GMT

Clustering is sometimes called "unsupervised classification", a term that I have mixed feelings on for reasons I will cover shortly, but it provides a good enough explanation of the problem to be worth covering. First, the problem is unsupervised -- we won't have a labeled dataset to guide our logic. Secondly we are looking to separate items into classes based on the predictors (technically they are not predictors they are "features" here because there is no response). The difference is that in supervised classification the class structure is known and labeled, whereas in clustering we are inventing the class structure from the feature values alone. In supervised classification we used the labels to single out one class and looked for predictors that had two qualities: 1) They had fairly common values for every example of that class and 2) they separated that class from others.

class structure, classification, supervised classification, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.33)

Add feedback

Trend-responsive User Segmentation Enabling Traceable Publishing Insights. A Case Study of a Real-world Large-scale News Recommendation System

Misztal-Radecka, Joanna, Rusiecki, Dominik, Żmuda, Michał, Bujak, Artur

arXiv.org Machine LearningOct-28-2019

The traditional offline approaches are no longer sufficient for building modern recommender systems in domains such as online news services, mainly due to the high dynamics of environment changes and necessity to operate on a large scale with high data sparsity. The ability to balance exploration with exploitation makes the multi-armed bandits an efficient alternative to the conventional methods, and a robust user segmentation plays a crucial role in providing the context for such online recommendation algorithms. In this work, we present an unsupervised and trend-responsive method for segmenting users according to their semantic interests, which has been integrated with a real-world system for large-scale news recommendations. The results of an online A/B test show significant improvements compared to a global-optimization algorithm on several services with different characteristics. Based on the experimental results as well as the exploration of segments descriptions and trend dynamics, we propose extensions to this approach that address particular real-world challenges for different use-cases. Moreover, we describe a method of generating traceable publishing insights facilitating the creation of content that serves the diversity of all users needs.

algorithm, recommendation, segmentation, (15 more...)

arXiv.org Machine Learning

1911.1107

Country:

North America > United States > New York > New York County > New York City (0.05)
Europe > Denmark > Capital Region > Copenhagen (0.05)
Asia > Middle East > Jordan (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Media > News (0.69)
Leisure & Entertainment > Sports > Soccer (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback