Goto

Collaborating Authors

 mapper graph


When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

Rair, Nisrine, Goupil, Alban, Vrabie, Valeriu, Chochoy, Emmanuel

arXiv.org Artificial Intelligence

Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.


Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

Yan, Xinyuan, Sevastjanova, Rita, van der Ben, Sinie, El-Assady, Mennatallah, Wang, Bei

arXiv.org Artificial Intelligence

Large language models (LLMs) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts. Investigating the topological structures of LLM embedding spaces via mapper graphs enables us to understand their underlying structures. Specifically, a mapper graph summarizes the topological structure of the embedding space, where each node represents a topological neighborhood (containing a cluster of embeddings), and an edge connects two nodes if their corresponding neighborhoods overlap. However, manually exploring these embedding spaces to uncover encoded linguistic properties requires considerable human effort. To address this challenge, we introduce a framework for semi-automatic annotation of these embedding properties. To organize the exploration process, we first define a taxonomy of explorable elements within a mapper graph such as nodes, edges, paths, components, and trajectories. The annotation of these elements is executed through two types of customizable LLM-based agents that employ perturbation techniques for scalable and automated analysis. These agents help to explore and explain the characteristics of mapper elements and verify the robustness of the generated explanations. We instantiate the framework within a visual analytics workspace and demonstrate its effectiveness through case studies. In particular, we replicate findings from prior research on BERT's embedding properties across various layers of its architecture and provide further observations into the linguistic properties of topological neighborhoods.


A Mapper Algorithm with implicit intervals and its optimization

Tao, Yuyang, Ge, Shufei

arXiv.org Machine Learning

The Mapper algorithm is an essential tool for visualizing complex, high dimensional data in topology data analysis (TDA) and has been widely used in biomedical research. It outputs a combinatorial graph whose structure implies the shape of the data. However,the need for manual parameter tuning and fixed intervals, along with fixed overlapping ratios may impede the performance of the standard Mapper algorithm. Variants of the standard Mapper algorithms have been developed to address these limitations, yet most of them still require manual tuning of parameters. Additionally, many of these variants, including the standard version found in the literature, were built within a deterministic framework and overlooked the uncertainty inherent in the data. To relax these limitations, in this work, we introduce a novel framework that implicitly represents intervals through a hidden assignment matrix, enabling automatic parameter optimization via stochastic gradient descent. In this work, we develop a soft Mapper framework based on a Gaussian mixture model(GMM) for flexible and implicit interval construction. We further illustrate the robustness of the soft Mapper algorithm by introducing the Mapper graph mode as a point estimation for the output graph. Moreover, a stochastic gradient descent algorithm with a specific topological loss function is proposed for optimizing parameters in the model. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. In addition, the application to an RNA expression dataset obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) successfully identifies a distinct subgroup of Alzheimer's Disease.


Improving Mapper's Robustness by Varying Resolution According to Lens-Space Density

Ruscitti, Kaleb D., McInnes, Leland

arXiv.org Machine Learning

We propose an improvement to the Mapper algorithm that removes the assumption of a single resolution scale across semantic space, and improves the robustness of the results under change of parameters. This eases parameter selection, especially for datasets with highly variable local density in the Morse function $f$ used for Mapper. This is achieved by incorporating this density into the choice of cover for Mapper. Furthermore, we prove that for covers with some natural hypotheses, the graph output by Mapper still converges in bottleneck distance to the Reeb graph of the Rips complex of the data, but captures more topological features than when using the usual Mapper cover. Finally, we discuss implementation details, and include the results of computational experiments. We also provide an accompanying reference implementation.


Bi-Filtration and Stability of TDA Mapper for Point Cloud Data

Bungula, Wako, Darcy, Isabel

arXiv.org Machine Learning

Carlsson, Singh and Memoli's TDA mapper takes a point cloud dataset and outputs a graph that depends on several parameter choices. Dey, Memoli, and Wang developed Multiscale Mapper for abstract topological spaces so that parameter choices can be analyzed via persistent homology. However, when applied to actual data, one does not always obtain filtrations of mapper graphs. DBSCAN, one of the most common clustering algorithms used in the TDA mapper software, has two parameters, \textbf{$\epsilon$} and \textbf{MinPts}. If \textbf{MinPts = 1} then DBSCAN is equivalent to single linkage clustering with cutting height \textbf{$\epsilon$}. We show that if DBSCAN clustering is used with \textbf{MinPts $>$ 2}, a filtration of mapper graphs may not exist except in the absence of free-border points; but such filtrations exist if DBSCAN clustering is used with \textbf{MinPts = 1} or \textbf{2} as the cover size increases, \textbf{$\epsilon$} increases, and/or \textbf{MinPts} decreases. However, the 1-dimensional filtration is unstable. If one adds noise to a data set so that each data point has been perturbed by a distance at most \textbf{$\delta$}, the persistent homology of the mapper graph of the perturbed data set can be significantly different from that of the original data set. We show that we can obtain stability by increasing both the cover size and \textbf{$\epsilon$} at the same time. In particular, we show that the bi-filtrations of the homology groups with respect to cover size and $\epsilon$ between these two datasets are \textbf{2$\delta$}-interleaved.


Differentiable Mapper For Topological Optimization Of Data Representation

Oulhaj, Ziyad, Carrière, Mathieu, Michel, Bertrand

arXiv.org Artificial Intelligence

Unsupervised data representation and visualization using tools from topology is an active and growing field of Topological Data Analysis (TDA) and data science. Its most prominent line of work is based on the so-called Mapper graph, which is a combinatorial graph whose topological structures (connected components, branches, loops) are in correspondence with those of the data itself. While highly generic and applicable, its use has been hampered so far by the manual tuning of its many parameters-among these, a crucial one is the so-called filter: it is a continuous function whose variations on the data set are the main ingredient for both building the Mapper representation and assessing the presence and sizes of its topological structures. However, while a few parameter tuning methods have already been investigated for the other Mapper parameters (i.e., resolution, gain, clustering), there is currently no method for tuning the filter itself. In this work, we build on a recently proposed optimization framework incorporating topology to provide the first filter optimization scheme for Mapper graphs. In order to achieve this, we propose a relaxed and more general version of the Mapper graph, whose convergence properties are investigated. Finally, we demonstrate the usefulness of our approach by optimizing Mapper graph representations on several datasets, and showcasing the superiority of the optimized representation over arbitrary ones.


A distribution-guided Mapper algorithm

Tao, Yuyang, Ge, Shufei

arXiv.org Artificial Intelligence

Motivation: The Mapper algorithm is an essential tool to explore shape of data in topology data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a reeb graph of data. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of data, especially when the underlying structure is complex. Results: In this work, we introduce a distribution guided Mapper algorithm named D-Mapper, that utilizes the property of the probability model and data intrinsic characteristics to generate density guided covers and provides enhanced topological features. Our proposed algorithm is a probabilistic model-based approach, which could serve as an alternative to non-prababilistic ones. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistence homology to measure the performance of Mapper type algorithm. Our numerical experiments indicate that the D-Mapper outperforms the classical Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequences dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both vertical and horizontal evolution processes of the viruses. Availability: Our package is available at https://github.com/ShufeiGe/D-Mapper.


$G$-Mapper: Learning a Cover in the Mapper Construction

Alvarado, Enrique, Belton, Robin, Fischer, Emily, Lee, Kang-Ju, Palande, Sourabh, Percival, Sarah, Purvine, Emilie

arXiv.org Machine Learning

The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. The Mapper algorithm requires tuning several parameters in order to generate a "nice" Mapper graph. The paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by conducting iteratively the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model in order to choose carefully the cover based on the distribution of a given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets.


Experimental Observations of the Topology of Convolutional Neural Network Activations

Purvine, Emilie, Brown, Davis, Jefferson, Brett, Joslyn, Cliff, Praggastis, Brenda, Rathore, Archit, Shapiro, Madelyn, Wang, Bei, Zhou, Youjia

arXiv.org Artificial Intelligence

Topological data analysis (TDA) is a branch of computational mathematics, bridging algebraic topology and data science, that provides compact, noise-robust representations of complex structures. Deep neural networks (DNNs) learn millions of parameters associated with a series of transformations defined by the model architecture, resulting in high-dimensional, difficult-to-interpret internal representations of input data. As DNNs become more ubiquitous across multiple sectors of our society, there is increasing recognition that mathematical methods are needed to aid analysts, researchers, and practitioners in understanding and interpreting how these models' internal representations relate to the final classification. In this paper, we apply cutting edge techniques from TDA with the goal of gaining insight into the interpretability of convolutional neural networks used for image classification. We use two common TDA approaches to explore several methods for modeling hidden-layer activations as high-dimensional point clouds, and provide experimental evidence that these point clouds capture valuable structural information about the model's process. First, we demonstrate that a distance metric based on persistent homology can be used to quantify meaningful differences between layers, and we discuss these distances in the broader context of existing representational similarity metrics for neural network interpretability. Second, we show that a mapper graph can provide semantic insight into how these models organize hierarchical class knowledge at each layer. These observations demonstrate that TDA is a useful tool to help deep learning practitioners unlock the hidden structures of their models.


Topology-Driven Generative Completion of Lacunae in Molecular Data

Zubarev, Dmitry Yu., Ristoski, Petar

arXiv.org Artificial Intelligence

Materials discovery is frequently driven by historical data sets that lack characteristics of the data sets specifically constructed to meet the needs of particular discovery efforts. They carry imprints of the ever-changing historical context of the research and development. Shifting priorities of the external funding, pressure for momentous technological breakthroughs, community perception of high-profile topics, and evolution of experimental capabilities render historical data a patchwork of findings with poorly understood internal structure. Statistical learning methods are typically concerned with statistical characteristics of the data. In the materials discovery, there is an additional pressure to understand the shape of the data in terms of what is known and what is missing and inform laborious and expensive data acquisition associated with material preparation, processing, and characterization. In this contribution, we are investigating the interplay between the shape of the historical data expressed as the structure of lacunae, such as gaps, loops, and voids, and the hypothesis generation that informs subsequent data acquisition. We describe an approach that explicitly identifies lacunae via topological data analysis (TDA) and fills them in using constrained generative modeling. TDA is concerned with capturing the shape of the data - the characteristics that are preserved under continuous deformations. The simplest widely accepted form of TDA is clustering.