Clustering
Efficient Data Analytics on Augmented Similarity Triplets
Ahmad, Muhammad, Shakeel, Muhammad Haroon, Ali, Sarwan, Khan, Imdadullah, Zaman, Arif, Karim, Asim
Many machine learning methods (classification, clustering, etc.) start with a known kernel that provides similarity or distance measure between two objects. Recent work has extended this to situations where the information about objects is limited to comparisons of distances between three objects (triplets). Humans find the comparison task much easier than the estimation of absolute similarities, so this kind of data can be easily obtained using crowd-sourcing. In this work, we give an efficient method of augmenting the triplets data, by utilizing additional implicit information inferred from the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics tasks. Secondly, we also propose a novel set of algorithms for common supervised and unsupervised machine learning tasks based on triplets. These methods work directly with triplets, avoiding kernel evaluations. Experimental evaluation on real and synthetic datasets shows that our methods are more accurate than the current best-known techniques.
Parameter Free Clustering with Cluster Catch Digraphs (Technical Report)
Manukyan, Artür, Ceyhan, Elvan
Clustering is one of the most challenging tasks in machine learning and pattern recognition, and perhaps, discovering the exact number of clusters of an unlabelled data set is the leading one. Many clustering methods find the clusters (or hidden classes) and the number of these clusters simultaneously (Frey and Dueck, 2007; Sajana et al., 2016). Although there exist methods for validating and comparing the quality of a partitioning of a data set, algorithms that provide the (estimated) number of clusters without any input parameter are still appealing. However, such methods or algorithms rely on other parameters viewed as the intensity, i.e. expected number of objects in a unit area. The value of the intensity parameter works as a threshold, and if the local intensity of the data set exceeds the threshold, it may indicate the existence of a possible cluster. However, the choice of such parameters is often a difficult task since different values of such parameters may drastically change the result of the algorithm. We use unsupervised adaptations of a family of vertex random digraphs, namely class cover catch digraphs (CCCDs), that showed relatively good performance in statistical pattern classification (Manukyan and Ceyhan, 2016; Priebe et al., 2003a). Unsupervised versions of CCCDs are called cluster catch digraphs (CCDs) (DeVinney, 2003; Marchette, 2004). Primarily, CCDs use statistics that require an intensity parameter to be specified or estimated.
Learning with Wasserstein barycenters and applications
Domazakis, G., Drivaliaris, D., Koukoulas, S., Papayiannis, G., Tsekrekos, A., Yannacopoulos, A.
In this work, learning schemes for measure-valued data are proposed, i.e. data that their structure can be more efficiently represented as probability measures instead of points on $\R^d$, employing the concept of probability barycenters as defined with respect to the Wasserstein metric. Such type of learning approaches are highly appreciated in many fields where the observational/experimental error is significant (e.g. astronomy, biology, remote sensing, etc.) or the data nature is more complex and the traditional learning algorithms are not applicable or effective to treat them (e.g. network data, interval data, high frequency records, matrix data, etc.). Under this perspective, each observation is identified by an appropriate probability measure and the proposed statistical learning schemes rely on discrimination criteria that utilize the geometric structure of the space of probability measures through core techniques from the optimal transport theory. The discussed approaches are implemented in two real world applications: (a) clustering eurozone countries according to their observed government bond yield curves and (b) classifying the areas of a satellite image to certain land uses categories which is a standard task in remote sensing. In both case studies the results are particularly interesting and meaningful while the accuracy obtained is high.
Evolutionary Clustering via Message Passing
Arzeno, Natalia M., Vikalo, Haris
We are often interested in clustering objects that evolve over time and identifying solutions to the clustering problem for every time step. Evolutionary clustering provides insight into cluster evolution and temporal changes in cluster memberships while enabling performance superior to that achieved by independently clustering data collected at different time points. In this paper we introduce evolutionary affinity propagation (EAP), an evolutionary clustering algorithm that groups data points by exchanging messages on a factor graph. EAP promotes temporal smoothness of the solution to clustering time-evolving data by linking the nodes of the factor graph that are associated with adjacent data snapshots, and introduces consensus nodes to enable cluster tracking and identification of cluster births and deaths. Unlike existing evolutionary clustering methods that require additional processing to approximate the number of clusters or match them across time, EAP determines the number of clusters and tracks them automatically. A comparison with existing methods on simulated and experimental data demonstrates effectiveness of the proposed EAP algorithm.
Machine Learning Interview Questions And Answers
Machine learning (ML) is a rising field. It offers many interesting and well-paid jobs and opportunities. Each of these and some other items might be touched in an ML interview. There is a large number of possible questions and topics. This article presents 12 general questions (with the brief answers) appropriate mainly for beginners and intermediates.
Self-adaption grey DBSCAN clustering
Clustering analysis, a classical issue in data mining, is widely used in various research areas. This article aims at proposing a self-adaption grey DBSCAN clustering (SAG-DBSCAN) algorithm. First, the grey relational matrix is used to obtain the grey local density indicator, and then this indicator is applied to make self-adapting noise identification for obtaining a dense subset of clustering dataset, finally, the DBSCAN which automatically selects parameters is utilized to cluster the dense subset. Several frequently-used datasets were used to demonstrate the performance and effectiveness of the proposed clustering algorithm and to compare the results with those of other state-of-the-art algorithms. The comprehensive comparisons indicate that our method has advantages over other compared methods.
An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data
Singh, Vikas, Verma, Nishchal K.
This paper presents a new fuzzy k-means algorithm for the clustering of high dimensional data in various subspaces. Since, In the case of high dimensional data, some features might be irrelevant and relevant but may have different significance in the clustering. For a better clustering, it is crucial to incorporate the contribution of these features in the clustering process. To combine these features, in this paper, we have proposed a new fuzzy k-means clustering algorithm in which the objective function of the fuzzy k-means is modified using two different entropy term. The first entropy term helps to minimize the within-cluster dispersion and maximize the negative entropy to determine clusters to contribute to the association of data points. The second entropy term helps to control the weight of the features because different features have different contributing weights in the clustering process for obtaining the better partition of the data. The efficacy of the proposed method is presented in terms of various clustering measures on multiple datasets and compared with various state-of-the-art methods.
Interpretable Embeddings From Molecular Simulations Using Gaussian Mixture Variational Autoencoders
Varolgunes, Yasemin Bozkurt, Bereau, Tristan, Rudzinski, Joseph F.
Extracting insight from the enormous quantity of data generated from molecular simulations requires the identification of a small number of collective variables whose corresponding low-dimensional free-energy landscape retains the essential features of the underlying system. Data-driven techniques provide a systematic route to constructing this landscape, without the need for extensive a priori intuition into the relevant driving forces. In particular, autoencoders are powerful tools for dimensionality reduction, as they naturally force an information bottleneck and, thereby, a low-dimensional embedding of the essential features. While variational autoencoders ensure continuity of the embedding by assuming a unimodal Gaussian prior, this is at odds with the multi-basin free-energy landscapes that typically arise from the identification of meaningful collective variables. In this work, we incorporate this physical intuition into the prior by employing a Gaussian mixture variational autoencoder (GMVAE), which encourages the separation of metastable states within the embedding. The GMVAE performs dimensionality reduction and clustering within a single unified framework, and is capable of identifying the inherent dimensionality of the input data, in terms of the number of Gaussians required to categorize the data. We illustrate our approach on two toy models, alanine dipeptide, and a challenging disordered peptide ensemble, demonstrating the enhanced clustering effect of the GMVAE prior compared to standard VAEs. The resulting embeddings appear to be promising representations for constructing Markov state models, highlighting the transferability of the dimensionality reduction from static equilibrium properties to dynamics.
Interactive Open-Ended Learning for 3D Object Recognition
The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new environments. Inspired by this capability, we seek to create a cognitive object perception and perceptual learning architecture that can learn 3D object categories in an open-ended fashion. In this context, ``open-ended'' implies that the set of categories to be learned is not known in advance, and the training instances are extracted from actual experiences of a robot, and thus become gradually available, rather than being available since the beginning of the learning process. In particular, this architecture provides perception capabilities that will allow robots to incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks. This framework integrates detection, tracking, teaching, learning, and recognition of objects. An extensive set of systematic experiments, in multiple experimental settings, was carried out to thoroughly evaluate the described learning approaches. Experimental results show that the proposed system is able to interact with human users, learn new object categories over time, as well as perform complex tasks. The contributions presented in this thesis have been fully implemented and evaluated on different standard object and scene datasets and empirically evaluated on different robotic platforms.
Balancing the Tradeoff Between Clustering Value and Interpretability
Saisubramanian, Sandhya, Galhotra, Sainyam, Zilberstein, Shlomo
Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a $\beta$-interpretable clustering algorithm that ensures that at least $\beta$ fraction of nodes in each cluster share the same feature value. The tunable parameter $\beta$ is user-specified. We also present a more efficient algorithm for scenarios with $\beta\!=\!1$ and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining.