Gretton, Arthur
(De)-regularized Maximum Mean Discrepancy Gradient Flow
Chen, Zonghao, Mustafi, Aratrika, Glaser, Pierre, Korba, Anna, Gretton, Arthur, Sriperumbudur, Bharath K.
We introduce a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation ($f$-divergence flows) or require strong assumptions, and modifications such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, DrMMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the DrMMD and the $\chi^2$-divergence, while the latter comes by treating DrMMD as MMD with a de-regularized kernel. Our numerical scheme uses an adaptive de-regularization schedule throughout the flow to optimally trade off between discretization errors and deviations from the $\chi^2$ regime. The potential application of the DrMMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks.
Spectral Representation for Causal Estimation with Hidden Confounders
Ren, Tongzheng, Sun, Haotian, Moulin, Antoine, Gretton, Arthur, Dai, Bo
We address the problem of causal effect estimation where hidden confounders are present, with a focus on two settings: instrumental variable regression with additional observed confounders, and proxy causal learning. Our approach uses a singular value decomposition of a conditional expectation operator, followed by a saddle-point optimization problem, which, in the context of IV regression, can be thought of as a neural net generalization of the seminal approach due to Darolles et al. [2011]. Saddle-point formulations have gathered considerable attention recently, as they can avoid double sampling bias and are amenable to modern function approximation methods. We provide experimental validation in various settings, and show that our approach outperforms existing methods on common benchmarks.
Mind the Graph When Balancing Data for Fairness or Robustness
Schrouff, Jessica, Bellot, Alexis, Rannen-Triki, Amal, Malek, Alan, Albuquerque, Isabela, Gretton, Arthur, D'Amour, Alexander, Chiappa, Silvia
Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
Conditional Bayesian Quadrature
Chen, Zonghao, Naslidnyk, Masha, Gretton, Arthur, Briol, François-Xavier
We propose a novel approach for estimating conditional or parametric expectations in the setting where obtaining samples or evaluating integrands is costly. Through the framework of probabilistic numerical methods (such as Bayesian quadrature), our novel approach allows to incorporates prior information about the integrands especially the prior smoothness knowledge about the integrands and the conditional expectation. As a result, our approach provides a way of quantifying uncertainty and leads to a fast convergence rate, which is confirmed both theoretically and empirically on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty.
Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms
Meunier, Dimitri, Shen, Zikai, Mollenhauer, Mattes, Gretton, Arthur, Li, Zhu
We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.
Deep MMD Gradient Flow without adversarial training
Galashov, Alexandre, de Bortoli, Valentin, Gretton, Arthur
One challenge that arises when applying these models in practice is that the Stein score (that is, the gradient log We propose a gradient flow procedure for generative of the current noisy density) becomes ill-behaved near the modeling by transporting particles from an data distribution (Yang et al., 2023): the diffusion process initial source distribution to a target distribution, needs to be slowed down at this point, which incurs a large where the gradient field on the particles is given number of sampling steps near the data distribution. Indeed, by a noise-adaptive Wasserstein Gradient of the if the manifold hypothesis holds (Tenenbaum et al., 2000; Maximum Mean Discrepancy (MMD). The noiseadaptive Fefferman et al., 2016; Brown et al., 2022) and the data MMD is trained on data distributions corrupted is supported on a lower dimensional space, it is expected by increasing levels of noise, obtained via that the score will explode for noise levels close to zero, a forward diffusion process, as commonly used to ensure that the backward process concentrates on this in denoising diffusion probabilistic models. The lower dimensional manifold (Bortoli, 2023; Pidstrigach, result is a generalization of MMD Gradient Flow, 2022; Chen et al., 2022). While strategies exist to mitigate which we call Diffusion-MMD-Gradient Flow or these issues, they trade-off the quality of the output against DMMD. The divergence training procedure is inference speed, see for instance (Song et al., 2023; Xu et al., related to discriminator training in Generative Adversarial 2023; Sauer et al., 2023). Networks (GAN), but does not require adversarial training. We obtain competitive empirical Generative Adversarial Networks (GANs) (Goodfellow performance in unconditional image generation et al., 2014) represent an alternative popular generative modelling on CIFAR10, MNIST, CELEB-A (64 x64) framework (Brock et al., 2019; Karras et al., 2020a).
Proxy Methods for Domain Adaptation
Tsai, Katherine, Pfohl, Stephen R., Salaudeen, Olawale, Chiou, Nicole, Kusner, Matt J., D'Amour, Alexander, Koyejo, Sanmi, Gretton, Arthur
We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional ''concept'' variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.
Practical Kernel Tests of Conditional Independence
Pogodin, Roman, Schrab, Antonin, Li, Yazhe, Sutherland, Danica J., Gretton, Arthur
We describe a data-efficient, kernel-based approach to statistical testing of conditional independence. A major challenge of conditional independence testing, absent in tests of unconditional independence, is to obtain the correct test level (the specified upper bound on the rate of false positives), while still attaining competitive test power. Excess false positives arise due to bias in the test statistic, which is obtained using nonparametric kernel ridge regression. We propose three methods for bias control to correct the test level, based on data splitting, auxiliary data, and (where possible) simpler function classes. We show these combined strategies are effective both for synthetic and real-world data.
A Distributional Analogue to the Successor Representation
Wiltzer, Harley, Farebrother, Jesse, Gretton, Arthur, Tang, Yunhao, Barreto, André, Dabney, Will, Bellemare, Marc G., Rowland, Mark
This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.
Controlling Moments with Kernel Stein Discrepancies
Kanagawa, Heishiro, Barp, Alessandro, Gretton, Arthur, Mackey, Lester
Kernel Stein discrepancies (KSDs) measure the quality of a distributional approximation and can be computed even when the target density has an intractable normalizing constant. Notable applications include the diagnosis of approximate MCMC samplers and goodness-of-fit tests for unnormalized statistical models. The present work analyzes the convergence control properties of KSDs. We first show that standard KSDs used for weak convergence control fail to control moment convergence. To address this limitation, we next provide sufficient conditions under which alternative diffusion KSDs control both moment and weak convergence. As an immediate consequence we develop, for each $q > 0$, the first KSDs known to exactly characterize $q$-Wasserstein convergence.