Goto

Collaborating Authors

 Clustering


Beginners Guide to the Three Types of Machine Learning - KDnuggets

#artificialintelligence

Machine learning problems can generally be divided into three types. Classification and regression, which are known as supervised learning, and unsupervised learning which in the context of machine learning applications often refers to clustering. In the following article, I am going to give a brief introduction to each of these three problems and will include a walkthrough in the popular python library scikit-learn. Before I start I'll give a brief explanation for the meaning behind the terms supervised and unsupervised learning. Supervised Learning: In supervised learning, you have a known set of inputs (features) and a known set of outputs (labels).


Introduction Hierarchical Clustering

#artificialintelligence

Clustering tries to find structure in data by creating groupings of data with similar characteristics. The most famous clustering algorithm is likely K-means, but there are a large number of ways to cluster observations. Hierarchical clustering is an alternative class of clustering algorithms that produce 1 to n clusters, where n is the number of observations in the data set. As you go down the hierarchy from 1 cluster (contains all the data) to n clusters (each observation is its own cluster), the clusters become more and more similar (almost always). There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).


Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering

arXiv.org Machine Learning

This paper is organized as follows: Section 2 briefly introduces some studies related to the formalization of theoretical frameworks in the context of the Data Clustering (DC) problem; Section 3 introduces a general formulation for the DC and HC problems; Section 4 discusses the Coarse-Refinement Dilemma considering the homology group H 0; Section 5 shows that homology groups of degree greater than zero are affected by overrefined and over-coarsed topologies; Section 6 compares our proposed generalization bounds to Carlsson and M emoli [12]'s consistency; finally, conclusions and future directions are provided in Section 8. 2. Related work Data Clustering (DC) faces many challenges in defining and guaranteeing generalization from datasets, as it does not rely on labels and, consequently, it cannot take advantage of computing any evident error measurement such as risk [7]. While studying this issue, Kleinberg [8] considered that a clustering model is an application of a mapping f on top of a distance function d: I I R, given I contains indices of data points in some fixed-size set S, disregarding its ambient space though [25]. From this initial setup, Kleinberg [8] defined three properties to be respected in order to assess clustering algorithms and models: - Scale-invariance: Given a distance and a clustering function, d and f, and a scalar ฮฑ, the following must hold f (d) f (ฮฑd). Thus, the similarity representation over S must be consistent with the units of measurement; - Consistency: Let ฮ“ be a partition of S and d,d null two distance functions. Function d null is referred to as a ฮ“ transformation of d if: (i) for all i,j S belonging to the same cluster, d null (i,j) d( i,j); and (ii) for all i,j S belonging to different clusters, d null (i,j) d( i,j). Consistency holds if f (d null) f ( d) whenever d null is a ฮฃ transformation of d.


Clustering by Directly Disentangling Latent Space

arXiv.org Machine Learning

To overcome the high dimensionality of data, learning latent feature representations for clustering has been widely studied recently. However, it is still challenging to learn "cluster-friendly" latent representations due to the unsupervised fashion of clustering. In this paper, we propose Disentangling Latent Space Clustering (DLS-Clustering), a new clustering mechanism that directly learning cluster assignment during the disentanglement of latent spacing without constructing the "cluster-friendly" latent representation and additional clustering methods. We achieve the bidirectional mapping by enforcing an inference network (i.e. encoder) and the generator of GAN to form a deterministic encoder-decoder pair with a maximum mean discrepancy (MMD)-based regularization. We utilize a weight-sharing procedure to disentangle latent space into the one-hot discrete latent variables and the continuous latent variables. The disentangling process is actually performing the clustering operation. Eventually the one-hot discrete latent variables can be directly expressed as clusters, and the continuous latent variables represent remaining unspecified factors. Experiments on six benchmark datasets of different types demonstrate that our method outperforms existing state-of-the-art methods. We further show that the latent representations from DLS-Clustering also maintain the ability to generate diverse and high-quality images, which can support more promising application scenarios.


Coordination Group Formation for OnLine Coordinated Routing Mechanisms

arXiv.org Machine Learning

This study considers that the collective route choices of travelers en route represent a resolution of their competition on network routes. Well understanding this competition and coordinating their route choices help mitigate urban traffic congestion. Even though existing studies have developed such mechanisms (e.g., the CRM [1]), we still lack the quantitative method to evaluate the coordination penitential and identify proper coordination groups (CG) to implement the CRM. Thus, they hit prohibitive computing difficulty when implemented with many opt-in travelers. Motived by this view, this study develops mathematical approaches to quantify the coordination potential between two and among multiple travelers. Next, we develop the adaptive centroid-based clustering algorithm (ACCA), which splits travelers en route in a local network into CGs, each with proper size and strong coordination potential. Moreover, the ACCA is statistically secured to stop at a local optimal clustering solution, which balances the inner-cluster and inter-cluster coordination potential. It can be implemented by parallel computation to accelerate its computing efficiency. Furthermore, we propose a clustering based coordinated routing mechanism (CB-CRM), which implements a CRM on each individual CG. The numerical experiments built upon both Sioux Falls and Hardee city networks show that the ACCA works efficiently to form proper coordination groups so that as compared to the CRM, the CB-CRM significantly improves computation efficiency with minor system performance loss in a large network. This merit becomes more apparent under high penetration and congested traffic condition. Last, the experiments validate the good features of the ACCA as well as the value of implementing parallel computation.


Detecting Patterns of Physiological Response to Hemodynamic Stress via Unsupervised Deep Learning

arXiv.org Machine Learning

Monitoring physiological responses to hemodynamic stress can help in determining appropriate treatment and ensuring good patient outcomes. Physicians' intuition suggests that the human body has a number of physiological response patterns to hemorrhage which escalate as blood loss continues, however the exact etiology and phenotypes of such responses are not well known or understood only at a coarse level. Although previous research has shown that machine learning models can perform well in hemorrhage detection and survival prediction, it is unclear whether machine learning could help to identify and characterize the underlying physiological responses in raw vital sign data. We approach this problem by first transforming the high-dimensional vital sign time series into a tractable, lower-dimensional latent space using a dilated, causal convolutional encoder model trained purely unsupervised. Second, we identify informative clusters in the embeddings. By analyzing the clusters of latent embeddings and visualizing them over time, we hypothesize that the clusters correspond to the physiological response patterns that match physicians' intuition. Furthermore, we attempt to evaluate the latent embeddings using a variety of methods, such as predicting the cluster labels using explainable features.


Text Mining using Nonnegative Matrix Factorization and Latent Semantic Analysis

arXiv.org Machine Learning

Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as Latent Semantic Analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on Nonnegative Matrix Factorization. NMF is employed to separate the terms into groups, and then each group`s term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-Means, which proves very useful for this specific type of data. In order to evaluate the proposed method, we compare it to some of the latest research done in this field, as well as some of the most practiced methods. In our experiments, we conclude that the proposed method either significantly improves clustering performance, or maintains the performance of other methods, while improving stability in results.


Subspace Clustering with Active Learning

arXiv.org Machine Learning

Nicos G. Pavlidis Department of Management Science Lancaster University Lancaster, UK n.pavlidis@lancaster.ac.uk Abstract --Subspace clustering is a growing field of unsupervised learning that has gained much popularity in the computer vision community. Applications can be found in areas such as motion segmentation and face clustering. It assumes that data originate from a union of subspaces, and clusters the data depending on the corresponding subspace. In practice, it is reasonable to assume that a limited amount of labels can be obtained, potentially at a cost. Therefore, algorithms that can effectively and efficiently incorporate this information to improve the clustering model are desirable. In this paper, we propose an active learning framework for subspace clustering that sequentially queries informative points and updates the subspace model. The query stage of the proposed framework relies on results from the perturbation theory of principal component analysis, to identify influential and potentially misclassified points. A constrained subspace clustering algorithm is proposed that monotonically decreases the objective function subject to the constraints imposed by the labelled data. We show that our proposed framework is suitable for subspace clustering algorithms including iterative methods and spectral methods. Experiments on synthetic data sets, motion segmentation data sets, and Y ale Faces data sets demonstrate the advantage of our proposed active strategy over state-of-the-art. Index T erms --high dimensionality; active learning; subspace clustering; constrained clustering I.


Towards automatic extractive text summarization of A-133 Single Audit reports with machine learning

arXiv.org Machine Learning

The rapid growth of text data has motivated the development of machine-learning based automatic text summarization strategies that concisely capture the essential ideas in a larger text. This study aimed to devise an extractive summarization method for A-133 Single Audits, which assess if recipients of federal grants are compliant with program requirements for use of federal funding. Currently, these voluminous audits must be manually analyzed by officials for oversight, risk management, and prioritization purposes. Automated summarization has the potential to streamline these processes. Analysis focused on the "Findings" section of ~20,000 Single Audits spanning 2016-2018. Following text preprocessing and GloVe embedding, sentence-level k-means clustering was performed to partition sentences by topic and to establish the importance of each sentence. For each audit, key summary sentences were extracted by proximity to cluster centroids. Summaries were judged by non-expert human evaluation and compared to human-generated summaries using the ROUGE metric. Though the goal was to fully automate summarization of A-133 audits, human input was required at various stages due to large variability in audit writing style, content, and context. Examples of human inputs include the number of clusters, the choice to keep or discard certain clusters based on their content relevance, and the definition of a top sentence. Overall, this approach made progress towards automated extractive summaries of A-133 audits, with future work to focus on full automation and improving summary consistency. This work highlights the inherent difficulty and subjective nature of automated summarization in a real-world application.


Convex Hierarchical Clustering for Graph-Structured Data

arXiv.org Machine Learning

Convex clustering is a recent stable alternative to hierarchical clustering. It formulates the recovery of progressively coalescing clusters as a regularized convex problem. While convex clustering was originally designed for handling Euclidean distances between data points, in a growing number of applications, the data is directly characterized by a similarity matrix or weighted graph. In this paper, we extend the robust hierarchical clustering approach to these broader classes of similarities. Having defined an appropriate convex objective, the crux of this adaptation lies in our ability to provide: (a) an efficient recovery of the regularization path and (b) an empirical demonstration of the use of our method. We address the first challenge through a proximal dual algorithm, for which we characterize both the theoretical efficiency as well as the empirical performance on a set of experiments. Finally, we highlight the potential of our method by showing its application to several real-life datasets, thus providing a natural extension to the current scope of applications of convex clustering.