context vector
251c5ffd6b62cc21c446c963c76cf214-Supplemental.pdf
A.1 Network Architecture Here, we describe the architecture of the eVAE presented in Figure 1 of the main paper, in more detail. Event Context Network: We adapt the architecture proposed in [21] for the event context network, but without the feature transformation preprocessing steps. In our implementation, we use three Conv1d layers of 64, 128 and 1024 channels each followed by BatchNorm and a ReLU activation. At the end of the ECN, we add the temporal features (see Appendix A.2) to the N 1024 feature tensor, and execute the max operation to result in a context vector. The sizes of the intermediate features and the context feature are hyperparameters that can be varied based on the application, data complexity etc. Encoder: The encoder for the VAE is composed of two layers, of sizes 1024 and 256 respectively, resulting in two output vectors of 1 8 each, corresponding to the mean and standard deviation for the latent space vector.
Exponential Family Embeddings
In this paper, we develop exponential family embeddings, a class of methods that extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is to model each observation conditioned on a set of other observations.
Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models
Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.
Uncertainty Quantification for Deep Regression using Contextualised Normalizing Flows
Marco, Adriel Sosa, Kirwan, John Daniel, Toumpa, Alexia, Gerasimou, Simos
Quantifying uncertainty in deep regression models is important both for understanding the confidence of the model and for safe decision-making in high-risk domains. Existing approaches that yield prediction intervals overlook distributional information, neglecting the effect of multimodal or asymmetric distributions on decision-making. Similarly, full or approximated Bayesian methods, while yielding the predictive posterior density, demand major modifications to the model architecture and retraining. We introduce MCNF, a novel post hoc uncertainty quantification method that produces both prediction intervals and the full conditioned predictive distribution. MCNF operates on top of the underlying trained predictive model; thus, no predictive model retraining is needed. We provide experimental evidence that the MCNF-based uncertainty estimate is well calibrated, is competitive with state-of-the-art uncertainty quantification methods, and provides richer information for downstream decision-making tasks.