Sarkar, Rik
Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms
Andreeva, Rayna, Dupuis, Benjamin, Sarkar, Rik, Birdal, Tolga, Şimşekli, Umut
We present a novel set of rigorous and computationally efficient topology-based complexity notions that exhibit a strong correlation with the generalization gap in modern deep neural networks (DNNs). DNNs show remarkable generalization properties, yet the source of these capabilities remains elusive, defying the established statistical learning theory. Recent studies have revealed that properties of training trajectories can be indicative of generalization. Building on this insight, state-of-the-art methods have leveraged the topology of these trajectories, particularly their fractal dimension, to quantify generalization. Most existing works compute this quantity by assuming continuous- or infinite-time training dynamics, complicating the development of practical estimators capable of accurately predicting generalization without access to test data. In this paper, we respect the discrete-time nature of training trajectories and investigate the underlying topological quantities that can be amenable to topological data analysis tools. This leads to a new family of reliable topological complexity measures that provably bound the generalization error, eliminating the need for restrictive geometric assumptions. These measures are computationally friendly, enabling us to propose simple yet effective algorithms for computing generalization indices. Moreover, our flexible framework can be extended to different domains, tasks, and architectures. Our experimental results demonstrate that our new complexity measures correlate highly with generalization error in industry-standards architectures such as transformers and deep graph networks. Our approach consistently outperforms existing topological bounds across a wide range of datasets, models, and optimizers, highlighting the practical relevance and effectiveness of our complexity measures.
Metric Space Magnitude for Evaluating Unsupervised Representation Learning
Limbeck, Katharina, Andreeva, Rayna, Sarkar, Rik, Rieck, Bastian
Determining suitable low-dimensional representations of complex high-dimensional data is a challenging task in numerous applications. Whether its preprocessing biological datasets prior to their analysis (Nguyen & Holmes, 2019), the visualisation of complex structure in single-cell sequencing data (Lähnemann et al., 2020), or the comparison of different manifold representations (Barannikov et al., 2022): an understanding of structural (dis)similarities is crucial, especially in the context of datasets that are ever-increasing in size and dimensionality. The primary assumption driving such analyses is the manifold hypothesis, which assumes that data is a (noisy) subsample from some unknown manifold. Operating under this assumption, manifold learning methods have made large advances in detecting complex structures in data, but they typically use local measures of the embedding quality, which are ultimately relying on local approximations of manifolds by k-nearest neighbour graphs. However, such approximations--which require specific parameter choices and thresholds--can have a substantial negative impact on both embedding results and the interpretation of evaluation scores. Moreover, countering the increasing popularity of non-linear dimensionality reduction methods that claim to preserve local and global structures, recent work (Chari & Pachter, 2023) sheds some doubt on the assumption that'good' embeddings should also faithfully preserve distances, while raising questions of how to measure the inevitable distortions introduced by representation learning. Thus, there is a need for novel methods in representation learning, which efficiently summarise data across varying levels of similarity, eliminating the need to rely on fixed neighbourhood graphs. Motivated by these considerations, we adopt a more general perspective that does not rely on manifold approximations. To this end, we propose a novel embedding quality measure based on metric space magnitude, a recently-proposed mathematical invariant that encapsulates numerous important geometric characteristics of metric spaces.
Machine learning and Topological data analysis identify unique features of human papillae in 3D scans
Andreeva, Rayna, Sarkar, Anwesha, Sarkar, Rik
The tongue surface houses a range of papillae that are integral to the mechanics and chemistry of taste and textural sensation. Although gustatory function of papillae is well investigated, the uniqueness of papillae within and across individuals remains elusive. Here, we present the first machine learning framework on 3D microscopic scans of human papillae (n = 2092), uncovering the uniqueness of geometric and topological features of papillae. The finer differences in shapes of papillae are investigated computationally based on a number of features derived from discrete differential geometry and computational topology. Interpretable machine learning techniques show that persistent homology features of the papillae shape are the most effective in predicting the biological variables. Models trained on these features with small volumes of data samples predict the type of papillae with an accuracy of 85%. The papillae type classification models can map the spatial arrangement of filiform and fungiform papillae on a surface. Remarkably, the papillae are found to be distinctive across individuals and an individual can be identified with an accuracy of 48% among the 15 participants from a single papillae. Collectively, this is the first unprecedented evidence demonstrating that tongue papillae can serve as a unique identifier inspiring new research direction for food preferences and oral diagnostics.
Inference and Interference: The Role of Clipping, Pruning and Loss Landscapes in Differentially Private Stochastic Gradient Descent
Watson, Lauren, Gan, Eric, Dantam, Mohan, Mirzasoleiman, Baharan, Sarkar, Rik
Differentially private stochastic gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks, compared to ordinary stochastic gradient descent (SGD). In this paper, we perform a detailed study and comparison of the two processes and unveil several new insights. By comparing the behavior of the two processes separately in early and late epochs, we find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result. This separate analysis of the clipping and noise addition steps of DP-SGD shows that while noise introduces errors to the process, gradient descent can recover from these errors when it is not clipped, and clipping appears to have a larger impact than noise. These effects are amplified in higher dimensions (large neural networks), where the loss basin occupies a lower dimensional space. We argue theoretically and using extensive experiments that magnitude pruning can be a suitable dimension reduction technique in this regard, and find that heavy pruning can improve the test accuracy of DPSGD.
Accelerated Shapley Value Approximation for Data Evaluation
Watson, Lauren, Kujawa, Zeno, Andreeva, Rayna, Yang, Hao-Tsung, Elahi, Tariq, Sarkar, Rik
Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.
Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet
Martínez, Gonzalo, Watson, Lauren, Reviriego, Pedro, Hernández, José Alberto, Juarez, Marc, Sarkar, Rik
The rapid adoption of generative Artificial Intelligence (AI) tools that can generate realistic images or text, such as DALL-E, MidJourney, or ChatGPT, have put the societal impacts of these technologies at the center of public debate. These tools are possible due to the massive amount of data (text and images) that is publicly available through the Internet. At the same time, these generative AI tools become content creators that are already contributing to the data that is available to train future models. Therefore, future versions of generative AI tools will be trained with a mix of human-created and AI-generated content, causing a potential feedback loop between generative AI and public data repositories. This interaction raises many questions: how will future versions of generative AI tools behave when trained on a mixture of real and AI generated data? Will they evolve and improve with the new data sets or on the contrary will they degrade? Will evolution introduce biases or reduce diversity in subsequent generations of generative AI tools? What are the societal implications of the possible degradation of these models? Can we mitigate the effects of this feedback loop? In this document, we explore the effect of this interaction and report some initial results using simple diffusion models trained with various image datasets. Our results show that the quality and diversity of the generated images can degrade over time suggesting that incorporating AI-created data can have undesired effects on future versions of generative models.
Metric Space Magnitude and Generalisation in Neural Networks
Andreeva, Rayna, Limbeck, Katharina, Rieck, Bastian, Sarkar, Rik
Deep learning models have seen significant successes in numerous applications, but their inner workings remain elusive. The purpose of this work is to quantify the learning process of deep neural networks through the lens of a novel topological invariant called magnitude. Magnitude is an isometry invariant; its properties are an active area of research as it encodes many known invariants of a metric space. We use magnitude to study the internal representations of neural networks and propose a new method for determining their generalisation capabilities. Moreover, we theoretically connect magnitude dimension and the generalisation error, and demonstrate experimentally that the proposed framework can be a good indicator of the latter.
The Shapley Value in Machine Learning
Rozemberczki, Benedek, Watson, Lauren, Bayer, Péter, Yang, Hao-Tsung, Kiss, Olivér, Nilsson, Sebastian, Sarkar, Rik
Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.
PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models
Rozemberczki, Benedek, Scherer, Paul, He, Yixuan, Panagopoulos, George, Astefanoaei, Maria, Kiss, Oliver, Beres, Ferenc, Collignon, Nicolas, Sarkar, Rik
We present PyTorch Geometric Temporal a deep learning framework combining state-of-the-art machine learning algorithms for neural spatiotemporal signal processing. The main goal of the library is to make temporal geometric deep learning available for researchers and machine learning practitioners in a unified easy-to-use framework. PyTorch Geometric Temporal was created with foundations on existing libraries in the PyTorch eco-system, streamlined neural network layer definitions, temporal snapshot generators for batching, and integrated benchmark datasets. These features are illustrated with a tutorial-like case study. Experiments demonstrate the predictive performance of the models implemented in the library on real world problems such as epidemiological forecasting, ridehail demand prediction and web-traffic management. Our sensitivity analysis of runtime shows that the framework can potentially operate on web-scale datasets with rich temporal features and spatial structure.
Chickenpox Cases in Hungary: a Benchmark Dataset for Spatiotemporal Signal Processing with Graph Neural Networks
Rozemberczki, Benedek, Scherer, Paul, Kiss, Oliver, Sarkar, Rik, Ferenci, Tamas
Recurrent graph convolutional neural networks are highly effective machine learning techniques for spatiotemporal signal processing. Newly proposed graph neural network architectures are repetitively evaluated on standard tasks such as traffic or weather forecasting. In this paper, we propose the Chickenpox Cases in Hungary dataset as a new dataset for comparing graph neural network architectures. Our time series analysis and forecasting experiments demonstrate that the Chickenpox Cases in Hungary dataset is adequate for comparing the predictive performance and forecasting capabilities of novel recurrent graph neural network architectures.