Gourgoulias, Kostis
DeepClean: Machine Unlearning on the Cheap by Resetting Privacy Sensitive Weights using the Fisher Diagonal
Shi, Jiaeli, Ghalyan, Najah, Gourgoulias, Kostis, Buford, John, Moran, Sean
Machine learning models trained on sensitive or private data can inadvertently memorize and leak that information. Machine unlearning seeks to retroactively remove such details from model weights to protect privacy. We contribute a lightweight unlearning algorithm that leverages the Fisher Information Matrix (FIM) for selective forgetting. Prior work in this area requires full retraining or large matrix inversions, which are computationally expensive. Our key insight is that the diagonal elements of the FIM, which measure the sensitivity of log-likelihood to changes in weights, contain sufficient information for effective forgetting. Specifically, we compute the FIM diagonal over two subsets -- the data to retain and forget -- for all trainable weights. This diagonal representation approximates the complete FIM while dramatically reducing computation. We then use it to selectively update weights to maximize forgetting of the sensitive subset while minimizing impact on the retained subset. Experiments show that our algorithm can successfully forget any randomly selected subsets of training data across neural network architectures. By leveraging the FIM diagonal, our approach provides an interpretable, lightweight, and efficient solution for machine unlearning with practical privacy benefits.
Estimating Class Separability of Datasets Using Persistent Homology with Application to LLM Fine-Tuning
Ghalyan, Najah, Gourgoulias, Kostis, Satsangi, Yash, Moran, Sean, Labonne, Maxime, Sabelja, Joseph
This paper proposes a method to estimate the class separability of an unlabeled text dataset by inspecting the topological characteristics of sentence-transformer embeddings of the text. Experiments conducted involve both binary and multi-class cases, with balanced and imbalanced scenarios. The results demonstrate a clear correlation and a better consistency between the proposed method and other separability and classification metrics, such as Thornton's method and the AUC score of a logistic regression classifier, as well as unsupervised methods. Finally, we empirically show that the proposed method can be part of a stopping criterion for fine-tuning language-model classifiers. By monitoring the class separability of the embedding space after each training iteration, we can detect when the training process stops improving the separability of the embeddings without using additional labels.
MultiVerse: Causal Reasoning using Importance Sampling in Probabilistic Programming
Perov, Yura, Graham, Logan, Gourgoulias, Kostis, Richens, Jonathan G., Lee, Ciarรกn M., Baker, Adam, Johri, Saurabh
Counterfactuals are particularly special causal questions as they involve the full suite of causal tools: posterior 1 inference and interventional reasoning (Pearl, 2000). Counterfactuals are probabilistic in nature and difficult to infer, but are powerful for explanation (Wachter et al., 2017; Sokol and Flach, 2018; Guidotti et al., 2018; Pedreschi et al., 2019), fairness Kusner et al. (2017); Zhang and Bareinboim (2018); Russell et al. (2017), policy search (e.g. Buesing et al. (2019)) and are also quantities of interest on their own (e.g.
Universal Marginaliser for Deep Amortised Inference for Probabilistic Programs
Walecki, Robert, Gourgoulias, Kostis, Baker, Adam, Hart, Chris, Lucas, Chris, Zwiessele, Max, Buchard, Albert, Lomeli, Maria, Perov, Yura, Johri, Saurabh
Probabilistic programming languages (PPLs) are powerful modelling tools which allow to formalise our knowledge about the world and reason about its inherent uncertainty. Inference methods used in PPL can be computationally costly due to significant time burden and/or storage requirements; or they can lack theoretical guarantees of convergence and accuracy when applied to large scale graphical models. To this end, we present the Universal Marginaliser (UM), a novel method for amortised inference, in PPL. We show how combining samples drawn from the original probabilistic program prior with an appropriate augmentation method allows us to train one neural network to approximate any of the corresponding conditional marginal distributions, with any separation into latent and observed variables, and thus amortise the cost of inference. Finally, we benchmark the method on multiple probabilistic programs, in Pyro, with different model structure.
Universal Marginalizer for Amortised Inference and Embedding of Generative Models
Walecki, Robert, Buchard, Albert, Gourgoulias, Kostis, Hart, Chris, Lomeli, Maria, Navarro, A. K. W., Zwiessele, Max, Perov, Yura, Johri, Saurabh
Probabilistic graphical models are powerful tools which allow us to formalise our knowledge about the world and reason about its inherent uncertainty. There exist a considerable number of methods for performing inference in probabilistic graphical models; however, they can be computationally costly due to significant time burden and/or storage requirements; or they lack theoretical guarantees of convergence and accuracy when applied to large scale graphical models. To this end, we propose the Universal Marginaliser Importance Sampler (UM-IS) -- a hybrid inference scheme that combines the flexibility of a deep neural network trained on samples from the model and inherits the asymptotic guarantees of importance sampling. We show how combining samples drawn from the graphical model with an appropriate masking function allows us to train a single neural network to approximate any of the corresponding conditional marginal distributions, and thus amortise the cost of inference. We also show that the graph embeddings can be applied for tasks such as: clustering, classification and interpretation of relationships between the nodes. Finally, we benchmark the method on a large graph (>1000 nodes), showing that UM-IS outperforms sampling-based methods by a large margin while being computationally efficient.