Goto

Collaborating Authors

 Clustering


The Classification of Optical Galaxy Morphology Using Unsupervised Learning Techniques

arXiv.org Artificial Intelligence

In recent years, large scale data intensive astronomical surveys have resulted in more detailed images being produced than scientists can manually classify. Even attempts to crowd-source this work will soon be outpaced by the large amount of data generated by modern surveys. This has brought into question the viability of human-based methods for classifying galaxy morphology. While supervised learning methods require datasets with existing labels, unsupervised learning techniques do not. Therefore, this paper implements unsupervised learning techniques to classify the Galaxy Zoo DECaLS dataset. A convolutional autoencoder feature extractor was trained and implemented. The resulting features were then clustered via k-means, fuzzy c-means and agglomerative clustering. These clusters were compared against the true volunteer classifications provided by the Galaxy Zoo DECaLS project. The best results, in general, were produced by the agglomerate clustering method. However, the increase in performance compared to k-means clustering was not significant considering the increase in clustering time. After undergoing the appropriate clustering algorithm optimizations, this approach could prove useful for classifying the better performing questions and could serve as the basis for a novel approach to generating more "human-like" galaxy morphology classifications from unsupervised techniques.


Affinity-VAE for disentanglement, clustering and classification of objects in multidimensional image data

arXiv.org Artificial Intelligence

In this work we present affinity-VAE: a framework for automatic clustering and classification of objects in multidimensional image data based on their similarity. The method expands on the concept of $\beta$-VAEs with an informed similarity-based loss component driven by an affinity matrix. The affinity-VAE is able to create rotationally-invariant, morphologically homogeneous clusters in the latent representation, with improved cluster separation compared with a standard $\beta$-VAE. We explore the extent of latent disentanglement and continuity of the latent spaces on both 2D and 3D image data, including simulated biological electron cryo-tomography (cryo-ET) volumes as an example of a scientific application.


Color Quantization -- Using K Means Clustering

#artificialintelligence

In simpler terms, it is the quantization of color spaces. Color spaces are a way to characterize the shade channels existing in the photo that offers the photograph that precise hue. This is a useful image compression technique which is quite useful for devices that can show a limited number of colors due to memory restriction. Each image can be represented by three features: the R, G and B values for each pixel. Given that our image has pixel values ranging from 0 to 255, we can say that each image has 256 * 256 * 256 colors. Our goal now is to reduce the number of colors to a manageable number.


Quantum Sparse Coding

arXiv.org Machine Learning

A ubiquitous problem in machine learning, statistics, and signal processing is to accurately estimate an unknown sparse vector from a few noisy linear measurements. This estimation problem, which we refer to as sparse coding, is at the heart of the field of compressed sensing, revealing that under sparsity assumptions it is possible to successfully recover a signal that sampled significantly below the Nyquist rate [1, 2]. This, in turn, led to a dramatic increase in magnetic resonance imaging (MRI) scanning session speed [3]. Another exciting application that also builds on the sparsity assumption is unsupervised representation learning, i.e., given high-dimensional input data, such as an image, finding a low-dimensional representation that captures the intrinsic underlying structure in the input [4, 5, 6]. These representations are often used in image restoration tasks to effectively remove noise (denoising) [7, 8], fill-in missing pixels (inpainting) [9, 10, 11], and to achieve high quality digital zoom (super-resolution) [10, 12, 13, 14]. Sparsity also plays a key role in linear regression when given a large pool of features, to form a predictive rule that estimates an unknown response using a smaller, interpretable subset of features that manifests the strongest effects [15, 16, 17, 18]. To formalize the sparse coding problem, which is central for tackling the aforementioned applications, we consider the following linear model: b = Ax + v, where A is a matrix of size M N, the vector x is of length N, and v is a noise vector of length M. In this paper, we focus on a challenging setting in which M N, where a crucial assumption we make is that the vector x is k-sparse, i.e., it contains only k non-zero elements with k N [2, 1, 19].


Grouping-matrix based Graph Pooling with Adaptive Number of Clusters

arXiv.org Artificial Intelligence

Graph pooling is a crucial operation for encoding hierarchical structures within graphs. Most existing graph pooling approaches formulate the problem as a node clustering task which effectively captures the graph topology. Conventional methods ask users to specify an appropriate number of clusters as a hyperparameter, then assume that all input graphs share the same number of clusters. In inductive settings where the number of clusters can vary, however, the model should be able to represent this variation in its pooling layers in order to learn suitable clusters. Thus we propose GMPool, a novel differentiable graph pooling architecture that automatically determines the appropriate number of clusters based on the input data. The main intuition involves a grouping matrix defined as a quadratic form of the pooling operator, which induces use of binary classification probabilities of pairwise combinations of nodes. GMPool obtains the pooling operator by first computing the grouping matrix, then decomposing it. Extensive evaluations on molecular property prediction tasks demonstrate that our method outperforms conventional methods.


Change Detection for Local Explainability in Evolving Data Streams

arXiv.org Artificial Intelligence

As complex machine learning models are increasingly used in sensitive applications like banking, trading or credit scoring, there is a growing demand for reliable explanation mechanisms. Local feature attribution methods have become a popular technique for post-hoc and model-agnostic explanations. However, attribution methods typically assume a stationary environment in which the predictive model has been trained and remains stable. As a result, it is often unclear how local attributions behave in realistic, constantly evolving settings such as streaming and online applications. In this paper, we discuss the impact of temporal change on local feature attributions. In particular, we show that local attributions can become obsolete each time the predictive model is updated or concept drift alters the data generating distribution. Consequently, local feature attributions in data streams provide high explanatory power only when combined with a mechanism that allows us to detect and respond to local changes over time. To this end, we present CDLEEDS, a flexible and model-agnostic framework for detecting local change and concept drift. CDLEEDS serves as an intuitive extension of attribution-based explanation techniques to identify outdated local attributions and enable more targeted recalculations. In experiments, we also show that the proposed framework can reliably detect both local and global concept drift. Accordingly, our work contributes to a more meaningful and robust explainability in online machine learning.


Semi-Supervised Clustering via Dynamic Graph Structure Learning

arXiv.org Artificial Intelligence

Most existing semi-supervised graph-based clustering methods exploit the supervisory information by either refining the affinity matrix or directly constraining the low-dimensional representations of data points. The affinity matrix represents the graph structure and is vital to the performance of semi-supervised graph-based clustering. However, existing methods adopt a static affinity matrix to learn the low-dimensional representations of data points and do not optimize the affinity matrix during the learning process. In this paper, we propose a novel dynamic graph structure learning method for semi-supervised clustering. In this method, we simultaneously optimize the affinity matrix and the low-dimensional representations of data points by leveraging the given pairwise constraints. Moreover, we propose an alternating minimization approach with proven convergence to solve the proposed nonconvex model. During the iteration process, our method cyclically updates the low-dimensional representations of data points and refines the affinity matrix, leading to a dynamic affinity matrix (graph structure). Specifically, for the update of the affinity matrix, we enforce the data points with remarkably different low-dimensional representations to have an affinity value of 0. Furthermore, we construct the initial affinity matrix by integrating the local distance and global self-representation among data points. Experimental results on eight benchmark datasets under different settings show the advantages of the proposed approach.


Understanding Self-Directed Learning in an Online Laboratory

arXiv.org Artificial Intelligence

We described a study on the use of an online laboratory for self-directed learning by constructing and simulating conceptual models of ecological systems. In this study, we could observe only the modeling behaviors and outcomes; the learning goals and outcomes were unknown. We used machine learning techniques to analyze the modeling behaviors of 315 learners and 822 conceptual models they generated. We derive three main conclusions from the results. First, learners manifest three types of modeling behaviors: observation (simulation focused), construction (construction focused), and full exploration (model construction, evaluation and revision). Second, while observation was the most common behavior among all learners, construction without evaluation was more common for less engaged learners and full exploration occurred mostly for more engaged learners. Third, learners who explored the full cycle of model construction, evaluation and revision generated models of higher quality. These modeling behaviors provide insights into self-directed learning at large.


Merged-GHCIDR: Geometrical Approach to Reduce Image Data

arXiv.org Artificial Intelligence

The computational resources required to train a model have been increasing since the inception of deep networks. Training neural networks on massive datasets have become a challenging and time-consuming task. So, there arises a need to reduce the dataset without compromising the accuracy. In this paper, we present novel variations of an earlier approach called reduction through homogeneous clustering for reducing dataset size. The proposed methods are based on the idea of partitioning the dataset into homogeneous clusters and selecting images that contribute significantly to the accuracy. We propose two variations: Geometrical Homogeneous Clustering for Image Data Reduction (GHCIDR) and Merged-GHCIDR upon the baseline algorithm - Reduction through Homogeneous Clustering (RHC) to achieve better accuracy and training time. The intuition behind GHCIDR involves selecting data points by cluster weights and geometrical distribution of the training set. Merged-GHCIDR involves merging clusters having the same labels using complete linkage clustering. We used three deep learning models- Fully Connected Networks (FCN), VGG1, and VGG16. We experimented with the two variants on four datasets- MNIST, CIFAR10, Fashion-MNIST, and Tiny-Imagenet. Merged-GHCIDR with the same percentage reduction as RHC showed an increase of 2.8%, 8.9%, 7.6% and 3.5% accuracy on MNIST, Fashion-MNIST, CIFAR10, and Tiny-Imagenet, respectively.


Advancing Reacting Flow Simulations with Data-Driven Models

arXiv.org Artificial Intelligence

The use of machine learning algorithms to predict behaviors of complex systems is booming. However, the key to an effective use of machine learning tools in multi-physics problems, including combustion, is to couple them to physical and computer models. The performance of these tools is enhanced if all the prior knowledge and the physical constraints are embodied. In other words, the scientific method must be adapted to bring machine learning into the picture, and make the best use of the massive amount of data we have produced, thanks to the advances in numerical computing. The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems. Examples of feature extraction in turbulent combustion data, empirical low-dimensional manifold (ELDM) identification, classification, regression, and reduced-order modeling are provided.