Goto

Collaborating Authors

 tsne


EmbedOR: Provable Cluster-Preserving Visualizations with Curvature-Based Stochastic Neighbor Embeddings

Saidi, Tristan Luca, Hickok, Abigail, Rieck, Bastian, Blumberg, Andrew J.

arXiv.org Artificial Intelligence

Stochastic Neighbor Embedding (SNE) algorithms like UMAP and tSNE often produce visualizations that do not preserve the geometry of noisy and high dimensional data. In particular, they can spuriously separate connected components of the underlying data submanifold and can fail to find clusters in well-clusterable data. To address these limitations, we propose EmbedOR, a SNE algorithm that incorporates discrete graph curvature. Our algorithm stochastically embeds the data using a curvature-enhanced distance metric that emphasizes underlying cluster structure. Critically, we prove that the EmbedOR distance metric extends consistency results for tSNE to a much broader class of datasets. We also describe extensive experiments on synthetic and real data that demonstrate the visualization and geometry-preservation capabilities of EmbedOR. We find that, unlike other SNE algorithms and UMAP, EmbedOR is much less likely to fragment continuous, high-density regions of the data. Finally, we demonstrate that the EmbedOR distance metric can be used as a tool to annotate existing visualizations to identify fragmentation and provide deeper insight into the underlying geometry of the data.



DiRe-JAX: A JAX based Dimensionality Reduction Algorithm for Large-scale Data

Kolpakov, Alexander, Rivin, Igor

arXiv.org Artificial Intelligence

Summary: DiRe - JAX is a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compared to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bio-informatics, and data science. Traditional dimensionality reduction techniques such as UMAP and tSNE are widely used for visualizing high-dimensional data in lower-dimensional spaces, usually 2D and sometimes 3D.


Reviews: No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms

Neural Information Processing Systems

Dimensionality reduction such as tSNE is widely used to visualize and interpret (and often over interpret) high-dimensional data. Thus such visualization has become a staple in the field and it is has been a while since I have seen substantial progress in improving such visualization techniques and this paper is such a case. Reviewer 1 summarizes the contribution and its importance better than I could word it myself: This work has two main contributions, which are sufficiently significant given the interest in visualization and dimensionality reduction via SNE, tSNE, and further extensions: 1. Identification of pressure points that are "stuck" in suboptimal location in the embedding due to local minima caused by dimensionality constraints. The manuscript is well written, well motivated, and convincingly establishes the reasoning behind the proposed approach as well as its effectiveness. All three reviewers agree on accepting the paper. All reviewers agreed that the paper provided new insights, a novel approach, a valuable practical contribution which is extensively validated on multiple datasets and is well written.


Large data limits and scaling laws for tSNE

Murray, Ryan, Pickarski, Adam

arXiv.org Machine Learning

This work considers large-data asymptotics for t-distributed stochastic neighbor embedding (tSNE), a widely-used non-linear dimension reduction algorithm. We identify an appropriate continuum limit of the tSNE objective function, which can be viewed as a combination of a kernel-based repulsion and an asymptotically-vanishing Laplacian-type regularizer. As a consequence, we show that embeddings of the original tSNE algorithm cannot have any consistent limit as $n \to \infty$. We propose a rescaled model which mitigates the asymptotic decay of the attractive energy, and which does have a consistent limit.


Unsupervised Learning via Network-Aware Embeddings

Damstrup, Anne Sophie Riis, Madsen, Sofie Tosti, Coscia, Michele

arXiv.org Artificial Intelligence

Data clustering, the task of grouping observations according to their similarity, is a key component of unsupervised learning - with real world applications in diverse fields such as biology, medicine, and social science. Often in these fields the data comes with complex interdependencies between the dimensions of analysis, for instance the various characteristics and opinions people can have live on a complex social network. Current clustering methods are ill-suited to tackle this complexity: deep learning can approximate these dependencies, but not take their explicit map as the input of the analysis. In this paper, we aim at fixing this blind spot in the unsupervised learning literature. We can create network-aware embeddings by estimating the network distance between numeric node attributes via the generalized Euclidean distance. Differently from all methods in the literature that we know of, we do not cluster the nodes of the network, but rather its node attributes. In our experiments we show that having these network embeddings is always beneficial for the learning task; that our method scales to large networks; and that we can actually provide actionable insights in applications in a variety of fields such as marketing, economics, and political science. Our method is fully open source and data and code are available to reproduce all results in the paper. Finding patterns in unlabeled data - a task known as unsupervised learning - is useful when we need to build understanding from data Hastie et al. (2009). Unsupervised learning includes grouping observations into clusters according to some criterion represented by a quality or loss function Gan et al. (2020) - data clustering. Applications range from grouping of genes with related expression patterns in biology Ranade et al. (2001), finding patterns in tissue images in medicine Filipovych et al. (2011), or segment customers for marketing purposes. Popular data clustering algorithms include DBSCAN Ester et al. (1996), OPTICS Ankerst et al. (1999), k-Means, and more. Modern data clustering approaches rely on deep learning and specifically deep neural networks Aljalbout et al. (2018); Aggarwal et al. (2018); Pang et al. (2021); Ezugwu et al. (2022), or denoising with autoencoders Nawaz et al. (2022); Cai et al. (2022). However, these approaches work in (deformations of) Euclidean spaces - where dependencies between the dimensions of the analysis can be learned Mahalanobis (1936); Xie et al. (2016) -, but the problem to be tackled here is fundamentally non-Euclidean Bronstein et al. (2017). Graph Neural Networks (GNN) Scarselli et al. (2008); Wu et al. (2022); Zhou et al. (2020a) work in non-Euclidean settings, and they are the focus of this paper.


Capturing the Flow of Art History

Ji, Chenxi

arXiv.org Artificial Intelligence

Do we really understand how machine classifies art styles? Historically, art is perceived and interpreted by human eyes and there are always controversial discussions over how people identify and understand art. Historians and general public tend to interpret the subject matter of art through the context of history and social factors. Style, however, is different from subject matter. Given the fact that Style does not correspond to the existence of certain objects in the painting and is mainly related to the form and can be correlated with features at different levels.(Ahmed Elgammal et al. 2018), which makes the identification and classification of the characteristics artwork's style and the "transition" - how it flows and evolves - remains as a challenge for both human and machine. In this project, a series of state-of-art neural networks and manifold learning algorithms are explored to unveil this intriguing topic: How does machine capture and interpret the flow of Art History?


Apple Leaf Disease Detection

#artificialintelligence

The foliar disease is due by Bacteria, Fungi, and Viruses. These diseases can attack leaves and cause spots, complete death and defoliation of leaves, affecting the plant's health. The data have, consists of four types of images healthy leaf, apple rust which is caused by a fungus called Gymnosporangium juniperi-virginianae, apple scab, which is caused by the ascomycete fungus Venturia inaequalis, and the last one is leaf which contains two are more diseases. Nowadays the yield of crops is up to mark in terms of quality and quantity due to many reasons quality of soil, pollution, fertilizers etc. which results in loss of income and quality of the field. Farmers are not aware of the diseases and their causes and solution.


Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data

Olobatuyi, Kehinde

arXiv.org Artificial Intelligence

Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of "Curse of dimensionality" on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the "FlexCWM" R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.


Why you should be using PHATE for dimensionality reduction

#artificialintelligence

As data scientists, we often work with high-dimensional data with more than 3 features, or dimensions, of interest. In supervised machine learning, we may use this data for training and classification for example and may reduce the dimensions to speed up the training. In unsupervised learning, we use this type of data for visualization and clustering. In single-cell RNA sequencing (scRNA-seq), for example, we accumulate measurements of tens of thousands of genes per cell for upwards of a million cells. That's a lot of data that provides a window into the cell's identity, state, and other properties.