Clustering
What are Gaussian Mixture Models? A Powerful Clustering Algorithm
They offer a completely different challenge to a supervised learning problem – there's much more room for experimenting with the data that I have. It's no wonder that the majority of developments and breakthroughs in the machine learning space are happening in the unsupervised learning domain. And one of the most popular techniques in unsupervised learning is clustering. It's a concept we typically learn early on in our machine learning journey and it's simple enough to grasp. I'm sure you've come across or even worked on projects like customer segmentation, market basket analysis, etc.
A Simple and Efficient Method to Compute a Single Linkage Dendrogram
Zhu, Huanbiao, Stuetzle, Werner
We address the problem of computing a single linkage dendrogram. A possible approach is to: (i) Form an edge weighted graph $G$ over the data, with edge weights reflecting dissimilarities. (ii) Calculate the MST $T$ of $G$. (iii) Break the longest edge of $T$ thereby splitting it into subtrees $T_L$, $T_R$. (iv) Apply the splitting process recursively to the subtrees. This approach has the attractive feature that Prim's algorithm for MST construction calculates distances as needed, and hence there is no need to ever store the inter-point distance matrix. The recursive partitioning algorithm requires us to determine the vertices (and edges) of $T_L$ and $T_R$. We show how this can be done easily and efficiently using information generated by Prim's algorithm without any additional computational cost.
Research and application of time series algorithms in centralized purchasing data
Bai, Yun, Jia, Suling, Li, Xixi
Based on the online transaction data of COSCO group's centralized procurement platform, this paper studies the clustering method of time series type data. The different methods of similarity calculation, different clustering methods with different K values are analysed, and the best clustering method suitable for centralized purchasing data is determined. The company list under the corresponding cluster is obtained. The time series motif discovery algorithm is used to model the centroid of each cluster. Through ARIMA method, we also made 12 periods of prediction for the centroid of each category. This paper constructs a matrix of "Customer Lifecycle Theory - Five Elements of Marketing ", and puts forward corresponding marketing suggestions for customers at different life cycle stages.
Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data (Revised)
Guggilam, Sreelekha, Zaidi, Syed M. A., Chandola, Varun, Patra, Abani K.
Most current clustering based anomaly detection methods use scoring schema and thresholds to classify anomalies. These methods are often tailored to target specific data sets with "known" number of clusters. The paper provides a streaming clustering and anomaly detection algorithm that does not require strict arbitrary thresholds on the anomaly scores or knowledge of the number of clusters while performing probabilistic anomaly detection and clustering simultaneously. This ensures that the cluster formation is not impacted by the presence of anomalous data, thereby leading to more reliable definition of "normal vs abnormal" behavior. The motivations behind developing the INCAD model and the path that leads to the streaming model is discussed.
Regularized Non-negative Spectral Embedding for Clustering
Wang, Yifei, Liu, Rui, Chen, Yong, Zhangs, Hui, Ye, Zhiwen
--Spectral Clustering is a popular technique to split data points into groups, especially for complex datasets. The algorithms in the Spectral Clustering family typically consist of multiple separate stages (such as similarity matrix construction, low-dimensional embedding, and K-Means clustering as post-processing), which may lead to sub-optimal results because of the possible mismatch between different stages. In this paper, we propose an end-to-end single-stage learning method to clustering called Regularized Nonnegative Spectral Embedding (RNSE) which extends Spectral Clustering with the adaptive learning of similarity matrix and meanwhile utilizes nonnegative constraints to facilitate one-step clustering (directly from data points to clustering labels). Two well-founded methods, successive alternating projection and strategic multiplicative update, are employed to work out the quite challenging optimization problems in RNSE. Extensive experiments on both synthetic and real-world datasets demonstrate RNSE's superior clustering performance to some state-of-the-art competitors. I NTRODUCTION Clustering is an important unsupervised learning task which aims to group a set of data objects into clusters in such a way that objects in the same cluster are more similar to each other than those in different clusters. For complex datasets, Spectral Clustering [1] and its many variants [2]- [4] are particularly popular due to their ability of discovering highly non-convex clusters.
Support vector clustering - Scholarpedia
The objective of clustering is to partition a data set into groups according to some criterion in an attempt to organize data into a more meaningful form. There are many ways of achieving this goal. Clustering may proceed according to some parametric model or by grouping points according to some distance or similarity measure as in hierarchical clustering. A natural way to put cluster boundaries is in regions in data space where there is little data, i.e. in "valleys" in the probability distribution of the data. This is the path taken in support vector clustering (SVC), which is based on the support vector approach (see Ben-Hur et al., 2001).
Meta-Learning to Cluster
Clustering is one of the most fundamental and wide-spread techniques in exploratory data analysis. Yet, the basic approach to clustering has not really changed: a practitioner hand-picks a task-specific clustering loss to optimize and fit the given data to reveal the underlying cluster structure. Some types of losses---such as k-means, or its non-linear version: kernelized k-means (centroid based), and DBSCAN (density based)---are popular choices due to their good empirical performance on a range of applications. Although every so often the clustering output using these standard losses fails to reveal the underlying structure, and the practitioner has to custom-design their own variation. In this work we take an intrinsically different approach to clustering: rather than fitting a dataset to a specific clustering loss, we train a recurrent model that learns how to cluster. The model uses as training pairs examples of datasets (as input) and its corresponding cluster identities (as output). By providing multiple types of training datasets as inputs, our model has the ability to generalize well on unseen datasets (new clustering tasks). Our experiments reveal that by training on simple synthetically generated datasets or on existing real datasets, we can achieve better clustering performance on unseen real-world datasets when compared with standard benchmark clustering techniques. Our meta clustering model works well even for small datasets where the usual deep learning models tend to perform worse.
Learning Deterministic Weighted Automata with Queries and Counterexamples
Weiss, Gail, Goldberg, Yoav, Yahav, Eran
We present an algorithm for extraction of a probabilistic deterministic finite automaton (PDFA) from a given black-box language model, such as a recurrent neural network (RNN). The algorithm is a variant of the exact-learning algorithm L*, adapted to a probabilistic setting with noise. The key insight is the use of conditional probabilities for observations, and the introduction of a local tolerance when comparing them. When applied to RNNs, our algorithm often achieves better word error rate (WER) and normalised distributed cumulative gain (NDCG) than that achieved by spectral extraction of weighted finite automata (WFA) from the same networks. PDFAs are substantially more expressive than n-grams, and are guaranteed to be stochastic and deterministic - unlike spectrally extracted WFAs.
Machine Learning, Part 3: Unsupervised Learning
Clustering is sometimes called "unsupervised classification", a term that I have mixed feelings on for reasons I will cover shortly, but it provides a good enough explanation of the problem to be worth covering. First, the problem is unsupervised -- we won't have a labeled dataset to guide our logic. Secondly we are looking to separate items into classes based on the predictors (technically they are not predictors they are "features" here because there is no response). The difference is that in supervised classification the class structure is known and labeled, whereas in clustering we are inventing the class structure from the feature values alone. In supervised classification we used the labels to single out one class and looked for predictors that had two qualities: 1) They had fairly common values for every example of that class and 2) they separated that class from others.
Trend-responsive User Segmentation Enabling Traceable Publishing Insights. A Case Study of a Real-world Large-scale News Recommendation System
Misztal-Radecka, Joanna, Rusiecki, Dominik, Żmuda, Michał, Bujak, Artur
The traditional offline approaches are no longer sufficient for building modern recommender systems in domains such as online news services, mainly due to the high dynamics of environment changes and necessity to operate on a large scale with high data sparsity. The ability to balance exploration with exploitation makes the multi-armed bandits an efficient alternative to the conventional methods, and a robust user segmentation plays a crucial role in providing the context for such online recommendation algorithms. In this work, we present an unsupervised and trend-responsive method for segmenting users according to their semantic interests, which has been integrated with a real-world system for large-scale news recommendations. The results of an online A/B test show significant improvements compared to a global-optimization algorithm on several services with different characteristics. Based on the experimental results as well as the exploration of segments descriptions and trend dynamics, we propose extensions to this approach that address particular real-world challenges for different use-cases. Moreover, we describe a method of generating traceable publishing insights facilitating the creation of content that serves the diversity of all users needs.