Goto

Collaborating Authors

 dimension reduction technique


Dimensionally Reduced Open-World Clustering: DROWCULA

arXiv.org Artificial Intelligence

Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, 'open-world' context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time.


EmbedOR: Provable Cluster-Preserving Visualizations with Curvature-Based Stochastic Neighbor Embeddings

arXiv.org Artificial Intelligence

Stochastic Neighbor Embedding (SNE) algorithms like UMAP and tSNE often produce visualizations that do not preserve the geometry of noisy and high dimensional data. In particular, they can spuriously separate connected components of the underlying data submanifold and can fail to find clusters in well-clusterable data. To address these limitations, we propose EmbedOR, a SNE algorithm that incorporates discrete graph curvature. Our algorithm stochastically embeds the data using a curvature-enhanced distance metric that emphasizes underlying cluster structure. Critically, we prove that the EmbedOR distance metric extends consistency results for tSNE to a much broader class of datasets. We also describe extensive experiments on synthetic and real data that demonstrate the visualization and geometry-preservation capabilities of EmbedOR. We find that, unlike other SNE algorithms and UMAP, EmbedOR is much less likely to fragment continuous, high-density regions of the data. Finally, we demonstrate that the EmbedOR distance metric can be used as a tool to annotate existing visualizations to identify fragmentation and provide deeper insight into the underlying geometry of the data.


Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture

arXiv.org Artificial Intelligence

The most popular technique to train a neural network is backpropagation. Recently, the Forward-Forward technique has also been introduced for certain learning tasks. However, in real life, human learning does not follow any of these techniques exclusively. The way a human learns is basically a combination of forward learning, backward propagation and cooperation. Humans start learning a new concept by themselves and try to refine their understanding hierarchically during which they might come across several doubts. The most common approach to doubt solving is a discussion with peers, which can be called cooperation. Cooperation/discussion/knowledge sharing among peers is one of the most important steps of learning that humans follow. However, there might still be a few doubts even after the discussion. Then the difference between the understanding of the concept and the original literature is identified and minimized over several revisions. Inspired by this, the paper introduces Forward-Cooperation-Backward (FCB) learning in a deep neural network framework mimicking the human nature of learning a new concept. A novel deep neural network architecture, called Multi Encoding Uni Decoding neural network model, has been designed which learns using the notion of FCB. A special lateral synaptic connection has also been introduced to realize cooperation. The models have been justified in terms of their performance in dimension reduction on four popular datasets. The ability to preserve the granular properties of data in low-rank embedding has been tested to justify the quality of dimension reduction. For downstream analyses, classification has also been performed. An experimental study on convergence analysis has been performed to establish the efficacy of the FCB learning strategy.


Scalable Methods for Nonnegative Matrix Factorizations of Near-separable Tall-and-skinny Matrices

Neural Information Processing Systems

Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, so-called "tall-and-skinny matrices." One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multi-core, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabyte-sized matrices from scientific computing and bioinformatics.


Input Guided Multiple Deconstruction Single Reconstruction neural network models for Matrix Factorization

arXiv.org Artificial Intelligence

Referring back to the original text in the course of hierarchical learning is a common human trait that ensures the right direction of learning. The models developed based on the concept of Non-negative Matrix Factorization (NMF), in this paper are inspired by this idea. They aim to deal with high-dimensional data by discovering its low rank approximation by determining a unique pair of factor matrices. The model, named Input Guided Multiple Deconstruction Single Reconstruction neural network for Non-negative Matrix Factorization (IG-MDSR-NMF), ensures the non-negativity constraints of both factors. Whereas Input Guided Multiple Deconstruction Single Reconstruction neural network for Relaxed Non-negative Matrix Factorization (IG-MDSR-RNMF) introduces a novel idea of factorization with only the basis matrix adhering to the non-negativity criteria. This relaxed version helps the model to learn more enriched low dimensional embedding of the original data matrix. The competency of preserving the local structure of data in its low rank embedding produced by both the models has been appropriately verified. The superiority of low dimensional embedding over that of the original data justifying the need for dimension reduction has been established. The primacy of both the models has also been validated by comparing their performances separately with that of nine other established dimension reduction algorithms on five popular datasets. Moreover, computational complexity of the models and convergence analysis have also been presented testifying to the supremacy of the models.


Scalable Methods for Nonnegative Matrix Factorizations of Near separable Tall and skinny Matrices

Neural Information Processing Systems

Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, so-called "tall-and-skinny matrices." One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multi-core, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabyte-sized matrices from scientific computing and bioinformatics.


Non-linear dimension reduction in factor-augmented vector autoregressions

arXiv.org Machine Learning

The COVID-19 pandemic belongs to the severest health, economic and social crises in recent decades and poses the greatest challenge to the world economy since World War II. The virus has spread around the globe and paralyzed entire economic sectors and activities. For economic modeling, the COVID-19 pandemic entails dealing with huge, unprecedented outliers in datasets which adversely affect the reliability of established, mostly linear, economic models. To the detriment of those commonly used models, economic indicators and variables are prone to unanticipated movements and do not respond in the way they are supposed to. Large shifts in the level of certain variables and strong deviations from their usual paths clearly aggravate the challenge of handling large outliers within existing econometric models.


RMFGP: Rotated Multi-fidelity Gaussian process with Dimension Reduction for High-dimensional Uncertainty Quantification

arXiv.org Machine Learning

Multi-fidelity modelling arises in many situations in computational science and engineering world. It enables accurate inference even when only a small set of accurate data is available. Those data often come from a high-fidelity model, which is computationally expensive. By combining the realizations of the high-fidelity model with one or more low-fidelity models, the multi-fidelity method can make accurate predictions of quantities of interest. This paper proposes a new dimension reduction framework based on rotated multi-fidelity Gaussian process regression and a Bayesian active learning scheme when the available precise observations are insufficient. By drawing samples from the trained rotated multi-fidelity model, the so-called supervised dimension reduction problems can be solved following the idea of the sliced average variance estimation (SAVE) method combined with a Gaussian process regression dimension reduction technique. This general framework we develop can effectively solve high-dimensional problems while the data are insufficient for applying traditional dimension reduction methods. Moreover, a more accurate surrogate Gaussian process model of the original problem can be obtained based on our trained model. The effectiveness of the proposed rotated multi-fidelity Gaussian process(RMFGP) model is demonstrated in four numerical examples. The results show that our method has better performance in all cases and uncertainty propagation analysis is performed for last two cases involving stochastic partial differential equations.


Applying PCA to Stocks

#artificialintelligence

This blog post is a summary of a data science project I worked on a few months ago. This was another attempt at trying to understand hidden trends in the stock market. Hopefully results will show you how "non-linear", complex, and unpredictable the market can be. Introduction: A stock's time series can be thought of as some realization of an underlying trend with added stochasticity or "noise". So surely, for an appropriate window of time, one can group bunches of stocks that are moving with an underlying trend.


CS 229 - Unsupervised Learning Cheatsheet

#artificialintelligence

Motivation The goal of unsupervised learning is to find hidden patterns in unlabeled data $\{x {(1)},...,x {(m)}\}$. Jensen's inequality Let $f$ be a convex function and $X$ a random variable. Latent variables Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted $z$. We note $c {(i)}$ the cluster of data point $i$ and $\mu_j$ the center of cluster $j$. Algorithm After randomly initializing the cluster centroids $\mu_1,\mu_2,...,\mu_k\in\mathbb{R} n$, the $k$-means algorithm repeats the following step until convergence: Algorithm It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.