Kungurtsev, Vyacheslav
Federated Sinkhorn
Kulcsar, Jeremy, Kungurtsev, Vyacheslav, Korpas, Georgios, Giaconi, Giulio, Shoosmith, William
In this work we investigate the potential of solving the discrete Optimal Transport (OT) problem with entropy regularization in a federated learning setting. Recall that the celebrated Sinkhorn algorithm transforms the classical OT linear program into strongly convex constrained optimization, facilitating first order methods for otherwise intractably large problems. A common contemporary setting that remains an open problem as far as the application of Sinkhorn is the presence of data spread across clients with distributed inter-communication, either due to clients whose privacy is a concern, or simply by necessity of processing and memory hardware limitations. In this work we investigate various natural procedures, which we refer to as Federated Sinkhorn, that handle distributed environments where data is partitioned across multiple clients. We formulate the problem as minimizing the transport cost with an entropy regularization term, subject to marginal constraints, where block components of the source and target distribution vectors are locally known to clients corresponding to each block. We consider both synchronous and asynchronous variants as well as all-to-all and server-client communication topology protocols. Each procedure allows clients to compute local operations on their data partition while periodically exchanging information with others. We provide theoretical guarantees on convergence for the different variants under different possible conditions. We empirically demonstrate the algorithms performance on synthetic datasets and a real-world financial risk assessment application. The investigation highlights the subtle tradeoffs associated with computation and communication time in different settings and how they depend on problem size and sparsity.
Cause
Kungurtsev, Vyacheslav, Moore, Leonardo Christov, Sir, Gustav, Krutsky, Martin
Causal Learning has emerged as a major theme of AI in recent years, promising to use special techniques to reveal the true nature of cause and effect in a number of important domains. We consider the Epistemology of learning and recognizing true cause and effect phenomena. Through thought exercises on the customary use of the word ''cause'', especially in scientific domains, we investigate what, in practice, constitutes a valid causal claim. We recognize the word's uses across scientific domains in disparate form but consistent function within the scientific paradigm. We highlight fundamental distinctions of practice that can be performed in the natural and social sciences, highlight the importance of many systems of interest being open and irreducible and identify the important notion of Hermeneutic knowledge for social science inquiry. We posit that the distinct properties require that definitive causal claims can only come through an agglomeration of consistent evidence across multiple domains and levels of abstraction, such as empirical, physiological, biochemical, etc. We present Cognitive Science as an exemplary multi-disciplinary field providing omnipresent opportunity for such a Research Program, and highlight the main general modes of practice of scientific inquiry that can adequately merge, rather than place as incorrigibly conflictual, multi-domain multi-abstraction scientific practices and language games.
Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration
Morafah, Mahdi, Kungurtsev, Vyacheslav, Chang, Hojin, Chen, Chen, Lin, Bill
Federated Learning has emerged as a promising paradigm for collaborative machine learning, while preserving user data privacy. Despite its potential, standard FL lacks support for diverse heterogeneous device prototypes, which vary significantly in model and dataset sizes -- from small IoT devices to large workstations. This limitation is only partially addressed by existing knowledge distillation techniques, which often fail to transfer knowledge effectively across a broad spectrum of device prototypes with varied capabilities. This failure primarily stems from two issues: the dilution of informative logits from more capable devices by those from less capable ones, and the use of a single integrated logits as the distillation target across all devices, which neglects their individual learning capacities and and the unique contributions of each. To address these challenges, we introduce TAKFL, a novel KD-based framework that treats the knowledge transfer from each device prototype's ensemble as a separate task, independently distilling each to preserve its unique contributions and avoid dilution. TAKFL also incorporates a KD-based self-regularization technique to mitigate the issues related to the noisy and unsupervised ensemble distillation process. To integrate the separately distilled knowledge, we introduce an adaptive task arithmetic knowledge integration process, allowing each student model to customize the knowledge integration for optimal performance. Additionally, we present theoretical results demonstrating the effectiveness of task arithmetic in transferring knowledge across heterogeneous devices with varying capacities. Comprehensive evaluations of our method across both CV and NLP tasks demonstrate that TAKFL achieves SOTA results in a variety of datasets and settings, significantly outperforming existing KD-based methods Code is released at https://github.com/MMorafah/TAKFL
Empirical Bayes for Dynamic Bayesian Networks Using Generalized Variational Inference
Kungurtsev, Vyacheslav, Apaar, null, Khandelwal, Aarya, Rastogi, Parth Sandeep, Chatterjee, Bapi, Mareฤek, Jakub
Dynamic Bayesian Networks (DBNs) are a class of Probabilistic Graphical Models that enable the modeling of a Markovian dynamic process through defining the kernel transition by the DAG structure of the graph found to fit a dataset. There are a number of structure learners than enable one to find the structure of a DBN to fit data, each of which with its own set of particular advantages and disadvantages. The structure of a DBN itself presents transparent criteria in order to identify causal discovery between variables. However, without the presence of large quantities of data, identifying a ground truth causal structure becomes unrealistic in practice. However, one can consider a procedure by which a set of graphs identifying structure are computed as approximate noisy solutions, and subsequently amortized in a broader statistical procedure fitting a mixture of DBNs. Each component of the mixture presents an alternative hypothesis on the causal structure. From the mixture weights, one can also compute the Bayes Factors comparing the preponderance of evidence between different models. This presents a natural opportunity for the development of Empirical Bayesian methods.
Learning Dynamic Bayesian Networks from Data: Foundations, First Principles and Numerical Comparisons
Kungurtsev, Vyacheslav, Rysavy, Petr, Idlahcen, Fadwa, Rytir, Pavel, Wodecki, Ales
In this paper, we present a guide to the foundations of learning Dynamic Bayesian Networks (DBNs) from data in the form of multiple samples of trajectories for some length of time. We present the formalism for a generic as well as a set of common types of DBNs for particular variable distributions. We present the analytical form of the models, with a comprehensive discussion on the interdependence between structure and weights in a DBN model and their implications for learning. Next, we give a broad overview of learning methods and describe and categorize them based on the most important statistical features, and how they treat the interplay between learning structure and weights. We give the analytical form of the likelihood and Bayesian score functions, emphasizing the distinction from the static case. We discuss functions used in optimization to enforce structural requirements. We briefly discuss more complex extensions and representations. Finally we present a set of comparisons in different settings for various distinct but representative algorithms across the variants.
Group Distributionally Robust Dataset Distillation with Risk Minimization
Vahidian, Saeed, Wang, Mingyu, Gu, Jianyang, Kungurtsev, Vyacheslav, Jiang, Wei, Chen, Yiran
Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments.
Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents
Jia, Yuqi, Vahidian, Saeed, Sun, Jingwei, Zhang, Jianyi, Kungurtsev, Vyacheslav, Gong, Neil Zhenqiang, Chen, Yiran
Data heterogeneity presents significant challenges for federated learning (FL). Recently, dataset distillation techniques have been introduced, and performed at the client level, to attempt to mitigate some of these challenges. In this paper, we propose a highly efficient FL dataset distillation framework on the server side, significantly reducing both the computational and communication demands on local devices while enhancing the clients' privacy. Unlike previous strategies that perform dataset distillation on local devices and upload synthetic data to the server, our technique enables the server to leverage prior knowledge from pre-trained deep generative models to synthesize essential data representations from a heterogeneous model architecture. This process allows local devices to train smaller surrogate models while enabling the training of a larger global model on the server, effectively minimizing resource utilization. We substantiate our claim with a theoretical analysis, demonstrating the asymptotic resemblance of the process to the hypothetical ideal of completely centralized training on a heterogeneous dataset. Empirical evidence from our comprehensive experiments indicates our method's superiority, delivering an accuracy enhancement of up to 40% over non-dataset-distillation techniques in highly heterogeneous FL contexts, and surpassing existing dataset-distillation methods by 18%. In addition to the high accuracy, our framework converges faster than the baselines because rather than the server trains on several sets of heterogeneous data distributions, it trains on a multi-modal distribution. Our code is available at https://github.com/FedDG23/FedDG-main.git
Quantum Solutions to the Privacy vs. Utility Tradeoff
Chatterjee, Sagnik, Kungurtsev, Vyacheslav
In this work, we propose a novel architecture (and several variants thereof) based on quantum cryptographic primitives with provable privacy and security guarantees regarding membership inference attacks on generative models. Our architecture can be used on top of any existing classical or quantum generative models. We argue that the use of quantum gates associated with unitary operators provides inherent advantages compared to standard Differential Privacy based techniques for establishing guaranteed security from all polynomial-time adversaries.
A Stochastic-Gradient-based Interior-Point Algorithm for Solving Smooth Bound-Constrained Optimization Problems
Curtis, Frank E., Kungurtsev, Vyacheslav, Robinson, Daniel P., Wang, Qi
The interior-point methodology is one of the most effective approaches for solving continuous constrained optimization problems. In the context of (deterministic) derivative-based algorithmic strategies, interiorpoint methods offer convergence guarantees from remote starting points [11, 21, 27], and in both convex and nonconvex settings such algorithms can offer good worst-case iteration complexity properties [7, 21]. Furthermore, many of the most popular software packages for solving large-scale continuous optimization problems are based on interior-point methods [1, 11, 24, 25, 26, 27], and these have been used to great effect for many years. Despite the extensive literature on theoretical and practical benefits of interior-point methods in the context of (deterministic) derivative-based algorithms for solving (non)convex optimization problems, to the best of our knowledge there has not yet been one that has been shown rigorously to offer convergence guarantees when neither function nor derivative evaluations are available, and instead only stochastic gradient estimates are employed.
Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions
Aspman, Johannes, Kungurtsev, Vyacheslav, Seraji, Reza Roohi
In many learning applications, the parameters in a model are structurally constrained in a way that can be modeled as them lying on a Riemannian manifold. Riemannian optimization, wherein procedures to enforce an iterative minimizing sequence to be constrained to the manifold, is used to train such models. At the same time, tame geometry has become a significant topological description of nonsmooth functions that appear in the landscapes of training neural networks and other important models with structural compositions of continuous nonlinear functions with nonsmooth maps. In this paper, we study the properties of such stratifiable functions on a manifold and the behavior of retracted stochastic gradient descent, with diminishing stepsizes, for minimizing such functions.