Goto

Collaborating Authors

 celeba


Debiased Counterfactual Generation via Flow Matching from Observations

arXiv.org Machine Learning

Estimating counterfactual distributions under interventions is central to treatment risk assessment and counterfactual generation tasks. Existing approaches model the counterfactual distribution as a standalone generative target, without exploiting its relationship to the observational data. In this work, we show that under standard assumptions, observational and counterfactual outcome distributions are tightly linked: they have identical support and tail behavior, remain statistically close under weak confounding, and share any features of high-dimensional outcomes which are invariant to confounders. These properties motivate learning counterfactual distributions not from scratch, but via a deconfounding flow from the observational distribution. We formulate this problem via flow-matching and derive a semiparametrically efficient estimator based on a novel efficient influence function correction. We subsequently extend our estimator to target minimal-energy flows in high-dimensions, which we show can be especially simple targets between observational and counterfactual distributions. In experiments, deconfounding flows outperform existing debiased counterfactual distribution estimators, while also mitigating known failure modes of flow-based methods.


CODA: ACorrelation-Oriented Disentanglement and Augmentation Modeling Scheme for Better Resisting Subpopulation Shifts

Neural Information Processing Systems

Data-driven models learned often struggle to generalize due to widespread subpopulation shifts, especially the presence of both spurious correlations and group imbalance (SC-GI). To learn models more powerful for defending against SC-GI, we propose a Correlation-Oriented Disentanglement and Augmentation (CODA) modeling scheme, which includes two unique developments: (1) correlation-oriented disentanglement and (2) strategic sample augmentation with reweighted consistency (RWC) loss. In (1), a bi-branch encoding process is developed to enable the disentangling of variant and invariant correlations by coordinating with a decoy classifier and the decoder reconstruction. In (2), a strategic sample augmentation based on disentangled latent features with RWC loss is designed to reinforce the training of a more generalizable model. The effectiveness of CODA is verified by benchmarking against a set of SOTA models in terms of worst-group accuracy and maximum group accuracy gap based on two famous datasets, ColoredMNIST and CelebA.




Wasserstein Iterative Networks for Barycenter Estimation

Neural Information Processing Systems

Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba!


Appendix AProofs

Neural Information Processing Systems

The proof follows from the following equality and the fact that Zγ is independent of q(z). All experiments are run on Nvidia GPUs. The exact softwares can be found in the supplemental code. The'letter' split of the EMNIST dataset was used as the auxiliary dataset. The images are resized to are 32x32.


The proposition makes use of the following observation: For the discriminator defined in (1), the norm of gradient for wt is upper bounded by k wtDθ(x)k F kxk LY

Neural Information Processing Systems

The upper bound of gradient's Frobenius norm for spectrally-normalized discriminators follows directly. As lw(x) is a linear transformation, we have lcw(x) = c lw(x), and lw(cx) = c lw(x). Moreover, since ReLU and leaky ReLU is linear in R+ and R region, we have ai(cx) = c ai(x). In this section we discuss the gradients with respect the actual parameter wi. From Eq. (12) in [30] we know wtDθ(x) = A, we know that w0tDθ(x) F, otl(x)Dθ(x), and kotl (x)k have upper bounds. From Theorem 1.1 in [44] we know that if wt is initialized with i.i.d random variables from uniform or Gaussian distribution, E kwtkspis lower bounded away from zero at initialization. So k wtDθ(x)kF is upper bounded at initialization. Moreover, we observe empirically that kwtksp is usually increasing during training. Therefore, k wtDθ(x)kF is typically upper bounded during training as well. The following proposition states that spectral normalization also gives an upper bound on kHwi(Dθ)(x)ksp for networks with ReLU or leaky ReLU internal activations.


2e9f978b222a956ba6bdf427efbd9ab3-Supplemental.pdf

Neural Information Processing Systems

B.3 Derivations of Eq. (19) Similar to derivation above, we give the gradient with respect to weight vector w RM+, which is given by wDKL = w log Z(U,w) wEU,w (log pθ(X |z))T1N + wEU,w (log pθ(U |z))Tw . The learning rate of each stochastic gradient descent step is γt t 1, where t {1,,T}denotes the iteration for optimization. We already report the t-SNE visualization of ByPE-VAE and standard VAE in Figure. Here we give more t-SNE visualization results. First, we randomly sample from ByPE-VAEs trained on different datasets, namely, MNIST, Fashion MNIST, and Celeba, as shown in Fig.7.