Plotting

 Mueller, Peter


Bayesian Density-Density Regression with Application to Cell-Cell Communications

arXiv.org Machine Learning

We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.


Summarizing Bayesian Nonparametric Mixture Posterior -- Sliced Optimal Transport Metrics for Gaussian Mixtures

arXiv.org Machine Learning

Existing methods to summarize posterior inference for mixture models focus on identifying a point estimate of the implied random partition for clustering, with density estimation as a secondary goal (Wade and Ghahramani, 2018; Dahl et al., 2022). We propose a novel approach for summarizing posterior inference in nonparametric Bayesian mixture models, prioritizing density estimation of the mixing measure (or mixture) as an inference target. One of the key features is the model-agnostic nature of the approach, which remains valid under arbitrarily complex dependence structures in the underlying sampling model. Using a decision-theoretic framework, our method identifies a point estimate by minimizing posterior expected loss. A loss function is defined as a discrepancy between mixing measures. Estimating the mixing measure implies inference on the mixture density and the random partition. Exploiting the discrete nature of the mixing measure, we use a version of sliced Wasserstein distance. We introduce two specific variants for Gaussian mixtures. The first, mixed sliced Wasserstein, applies generalized geodesic projections on the product of the Euclidean space and the manifold of symmetric positive definite matrices. The second, sliced mixture Wasserstein, leverages the linearity of Gaussian mixture measures for efficient projection.


Consensus Monte Carlo for Random Subsets using Shared Anchors

arXiv.org Machine Learning

We develop a consensus Monte Carlo (CMC) algorithm for Bayesian nonparametric (BNP) inference with large datasets that are too big for full posterior simulation on a single machine, due to CPU or memory limitations. The proposed algorithm is for inference under BNP models for random subsets, including clustering, feature allocation (FA), and related models. We distribute a large dataset to multiple machines, run separate instances of Markov chain Monte Carlo (MCMC) simulations in parallel and then aggregate the Monte Carlo samples across machines. The idea of the proposed CMC hinges on choosing a portion of observations as anchor points (Kunkel and Peruggia, 2018) which are distributed to every machine along with other observations that are only available to one machine. Those anchor points then serve as anchors to merge Monte Carlo draws of clusters or features across machines.