Clustering
Ultrametric Fitting by Gradient Descent
Chierchia, Giovanni, Perret, Benjamin
We study the problem of fitting an ultrametric distance to a dissimilarity graph in the context of hierarchical cluster analysis. Standard hierarchical clustering methods are specified procedurally, rather than in terms of the cost function to be optimized. We aim to overcome this limitation by presenting a general optimization framework for ultrametric fitting. Our approach consists of modeling the latter as a constrained optimization problem over the continuous space of ultrametrics. So doing, we can leverage the simple, yet effective, idea of replacing the ultrametric constraint with a min-max operation injected directly into the cost function.
Clustering with Fast, Automated and Reproducible assessment applied to longitudinal neural tracking
Zhu, Hanlin, Li, Xue, Sun, Liuyang, He, Fei, Zhao, Zhengtuo, Luan, Lan, Tran, Ngoc Mai, Xie, Chong
Across many areas, from neural tracking to database entity resolution, manual assessment of clusters by human experts presents a bottleneck in rapid development of scalable and specialized clustering methods. To solve this problem we develop C-FAR, a novel method for Fast, Automated and Reproducible assessment of multiple hierarchical clustering algorithms simultaneously. Our algorithm takes any number of hierarchical clustering trees as input, then strategically queries pairs for human feedback, and outputs an optimal clustering among those nominated by these trees. While it is applicable to large dataset in any domain that utilizes pairwise comparisons for assessment, our flagship application is the cluster aggregation step in spike-sorting, the task of assigning waveforms (spikes) in recordings to neurons. On simulated data of 96 neurons under adverse conditions, including drifting and 25\% blackout, our algorithm produces near-perfect tracking relative to the ground truth. Our runtime scales linearly in the number of input trees, making it a competitive computational tool. These results indicate that C-FAR is highly suitable as a model selection and assessment tool in clustering tasks.
Finding parking spots with Custom Vision and IoT
Machine learning problems can generally be divided into three types. Classification and regression, which are known as supervised learning, and unsupervised learning which in the context of machine learning applications often refers to clustering. Machine learning problems can generally be divided into three types. Classification and regression, which are known as supervised learning, and unsupervised learning which in the context of machine learning applications often refers to clustering. In the following article, I am going to give a brief introduction to each of these three problems and will include a walkthrough in the popular python library scikit-learn.
An Automatic Attribute Based Access Control Policy Extraction from Access Logs
Karimi, Leila, Aldairi, Maryam, Joshi, James, Abdelhakim, Mai
With the rapid advances in computing and information technologies, traditional access control models have become inadequate in terms of capturing fine-grained, and expressive security requirements of newly emerging applications. An attribute-based access control (ABAC) model provides a more flexible approach for addressing the authorization needs of complex and dynamic systems. While organizations are interested in employing newer authorization models, migrating to such models pose as a significant challenge. Many large-scale businesses need to grant authorization to their user populations that are potentially distributed across disparate and heterogeneous computing environments. Each of these computing environments may have its own access control model. The manual development of a single policy framework for an entire organization is tedious, costly, and error-prone. In this paper, we present a methodology for automatically learning ABAC policy rules from access logs of a system to simplify the policy development process. The proposed approach employs an unsupervised learning-based algorithm for detecting patterns in access logs and extracting ABAC authorization rules from these patterns. In addition, we present two policy improvement algorithms, including rule pruning and policy refinement algorithms to generate a higher quality mined policy. Finally, we implement a prototype of the proposed approach to demonstrate its feasibility.
Directionally Dependent Multi-View Clustering Using Copula Model
Afrin, Kahkashan, Iquebal, Ashif S., Karimi, Mostafa, Souris, Allyson, Lee, Se Yoon, Mallick, Bani K.
In recent biomedical scientific problems, it is a fundamental issue to integratively cluster a set of objects from multiple sources of datasets. Such problems are mostly encountered in genomics, where data is collected from various sources, and typically represent distinct yet complementary information. Integrating these data sources for multi-source clustering is challenging due to their complex dependence structure including directional dependency. Particularly in genomics studies, it is known that there is certain directional dependence between DNA expression, DNA methylation, and RNA expression, widely called The Central Dogma. Most of the existing multi-view clustering methods either assume an independent structure or pair-wise (non-directional) dependency, thereby ignoring the directional relationship. Motivated by this, we propose a copula-based multi-view clustering model where a copula enables the model to accommodate the directional dependence existing in the datasets. We conduct a simulation experiment where the simulated datasets exhibiting inherent directional dependence: it turns out that ignoring the directional dependence negatively affects the clustering performance. As a real application, we applied our model to the breast cancer tumor samples collected from The Cancer Genome Altas (TCGA).
Unsupervised machine learning of quantum phase transitions using diffusion maps
Lidiak, Alexander, Gong, Zhexuan
Experimental quantum simulators have become large and complex enough that discovering new physics from the huge amount of measurement data can be quite challenging, especially when little theoretical understanding of the simulated model is available. Unsupervised machine learning methods are particularly promising in overcoming this challenge. For the specific task of learning quantum phase transitions, unsupervised machine learning methods have primarily been developed for phase transitions characterized by simple order parameters, typically linear in the measured observables. However, such methods often fail for more complicated phase transitions, such as those involving incommensurate phases, valence-bond solids, topological order, and many-body localization. We show that the diffusion map method, which performs nonlinear dimensionality reduction and spectral clustering of the measurement data, has significant potential for learning such complex phase transitions unsupervised. This method works for measurements of local observables in a single basis and is thus readily applicable to many experimental quantum simulators as a versatile tool for learning various quantum phases and phase transitions.
Characterising hot stellar systems with confidence
Chattopadhyay, Souradeep, Maitra, Ranjan
Hot stellar systems (HSS) are a collection of stars bound together by gravitational attraction. These systems hold clues to many mysteries of outer space so understanding their origin, evolution and physical properties is important but remains a huge challenge. We used multivariate $t$-mixtures model-based clustering to analyze 13456 hot stellar systems from Misgeld & Hilker (2011) that included 12763 candidate globular clusters and found eight homogeneous groups using the Bayesian Information Criterion (BIC). A nonparametric bootstrap procedure was used to estimate the confidence of each of our clustering assignments. The eight obtained groups can be characterized in terms of the correlation, mass, effective radius and surface density. Using conventional correlation-mass-effective radius-surface density notation, the largest group, Group 1, can be described as having positive-low-low-moderate characteristics. The other groups, numbered in decreasing sizes are similarly characterised, with Group 2 having positive-low-low-high characteristics, Group 3 displaying positive-low-low-moderate characteristics, Group 4 having positive-low-low-high characteristic, Group 5 displaying positive-low-moderate-moderate characteristic and Group 6 showing positive-moderate-low-high characteristic. The smallest group (Group 8) shows negative-low-moderate-moderate characteristic. Group 7 has no candidate clusters and so cannot be similarly labeled but the mass, effective radius correlation for these non-candidates indicates that they zare larger than typical globular clusters. Assertions drawn for each group are ambiguous for a few HSS having low confidence in classification. Our analysis identifies distinct kinds of HSS with varying confidence and provides novel insight into their physical and evolutionary properties.
Towards automated kernel selection in machine learning systems: A SYCL case study
Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network topology is constant, but machine learning research often involves changing network topologies and hyperparameters. Traditional kernel auto-tuning has limited impact in this case; a more general selection of kernels is required for libraries to accelerate machine learning research. In this paper we present initial results using machine learning to select kernels in a case study deploying high performance SYCL kernels in libraries that target a range of heterogeneous devices from desktop GPUs to embedded accelerators. The techniques investigated apply more generally and could similarly be integrated with other heterogeneous programming systems. By combining auto-tuning and machine learning these kernel selection processes can be deployed with little developer effort to achieve high performance on new hardware.
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
Hodari, Zack, Lai, Catherine, King, Simon
In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.
DHOG: Deep Hierarchical Object Grouping
Darlow, Luke Nicholas, Storkey, Amos
Recently, a number of competitive methods have tackled unsupervised representation learning by maximising the mutual information between the representations produced from augmentations. The resulting representations are then invariant to stochastic augmentation strategies, and can be used for downstream tasks such as clustering or classification. Yet data augmentations preserve many properties of an image and so there is potential for a suboptimal choice of representation that relies on matching easy-to-find features in the data. We demonstrate that greedy or local methods of maximising mutual information (such as stochastic gradient optimisation) discover local optima of the mutual information criterion; the resulting representations are also less-ideally suited to complex downstream tasks. Earlier work has not specifically identified or addressed this issue. We introduce deep hierarchical object grouping (DHOG) that computes a number of distinct discrete representations of images in a hierarchical order, eventually generating representations that better optimise the mutual information objective. We also find that these representations align better with the downstream task of grouping into underlying object classes. We tested DHOG on unsupervised clustering, which is a natural downstream test as the target representation is a discrete labelling of the data. We achieved new state-of-the-art results on the three main benchmarks without any prefiltering or Sobel-edge detection that proved necessary for many previous methods to work. We obtain accuracy improvements of: 4.3% on CIFAR-10, 1.5% on CIFAR-100-20, and 7.2% on SVHN.