A Bayesian Inference over Neural Networks
The prior and likelihood are both modelling choices. Since (14) is intractable, we typically sample a finite set of parameters and compute a Monte Carlo estimator. A.1 Likelihoods for BNNs The likelihood is purely a function of the model prediction Φ As such, BNN likelihood distributions follow the standard choices used in other probabilistic models. Neal [21] shows that in the regression setting, the isotropic Gaussian prior for a BNN with a single hidden layer approaches a Gaussian process prior as the number of hidden units tends to infinity, so long as the chosen activation function is bounded. We will use this prior in the baseline BNN for our experiments.
Cross-modal Representation Flattening for Multi-modal Domain Generalization Yunfeng Fan 1
Multi-modal domain generalization (MMDG) requires that models trained on multimodal source domains can generalize to unseen target distributions with the same modality set. Sharpness-aware minimization (SAM) is an effective technique for traditional uni-modal domain generalization (DG), however, with limited improvement in MMDG. In this paper, we identify that modality competition and discrepant uni-modal flatness are two main factors that restrict multi-modal generalization. To overcome these challenges, we propose to construct consistent flat loss regions and enhance knowledge exploitation for each modality via cross-modal knowledge transfer. Firstly, we turn to the optimization on representation-space loss landscapes instead of traditional parameter space, which allows us to build connections between modalities directly. Then, we introduce a novel method to flatten the high-loss region between minima from different modalities by interpolating mixed multi-modal representations. We implement this method by distilling and optimizing generalizable interpolated representations and assigning distinct weights for each modality considering their divergent generalization capabilities. Extensive experiments are performed on two benchmark datasets, EPIC-Kitchens and Human-Animal-Cartoon (HAC), with various modality combinations, demonstrating the effectiveness of our method under multi-source and single-source settings.
Supplementary Materials for " Private Set Generation with Discriminative Information "
These supplementary materials include the privacy analysis ( A), the details of the adopted algorithms ( B), and the details of experiment setup ( C), and additional results and discussions ( D). Our privacy computation is based on the notion of Rényi-DP, which we recall as follows. Lastly, we use the following theorem to convert (α, ε)-RDP to (ε, δ)-DP. We present the pseudocode of the generator prior experiments (Section 6 of the main paper) in Algorithm 2, which is supplementary to Figure 4,5 and Equation 8 of the main paper. While it is possible to allow random sampling of the latent code and generate changeable S to mimic the training of generative models (i.e., train a generative network using the gradient matching loss), we observe that the training easily fails in the early stage.
Private Set Generation with Discriminative Information
Differentially private data generation techniques have become a promising solution to the data privacy challenge -- it enables sharing of data while complying with rigorous privacy guarantees, which is essential for scientific progress in sensitive domains. Unfortunately, restricted by the inherent complexity of modeling highdimensional distributions, existing private generative models are struggling with the utility of synthetic samples. In contrast to existing works that aim at fitting the complete data distribution, we directly optimize for a small set of samples that are representative of the distribution under the supervision of discriminative information from downstream tasks, which is generally an easier task and more suitable for private training. Our work provides an alternative view for differentially private generation of high-dimensional data and introduces a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention
Conditional diffusion models have shown remarkable success in visual content generation, producing high-quality samples across various domains, largely due to classifier-free guidance (CFG). Recent attempts to extend guidance to unconditional models have relied on heuristic techniques, resulting in suboptimal generation quality and unintended effects. In this work, we propose Smoothed Energy Guidance (SEG), a novel training-and condition-free approach that leverages the energybased perspective of the self-attention mechanism to enhance image generation. By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction. Practically, we control the curvature of the energy landscape by adjusting the Gaussian kernel parameter while keeping the guidance scale parameter fixed. Additionally, we present a query blurring method that is equivalent to blurring the entire attention weights without incurring quadratic complexity in the number of tokens. In our experiments, SEG achieves a Pareto improvement in both quality and the reduction of side effects.
Revisiting the Sample Complexity of Sparse Spectrum Approximation of Gaussian Processes
We introduce a new scalable approximation for Gaussian processes with provable guarantees which hold simultaneously over its entire parameter space. Our approximation is obtained from an improved sample complexity analysis for sparse spectrum Gaussian processes (SSGPs). In particular, our analysis shows that under a certain data disentangling condition, an SSGP's prediction and model evidence (for training) can well-approximate those of a full GP with low sample complexity. We also develop a new auto-encoding algorithm that finds a latent space to disentangle latent input coordinates into well-separated clusters, which is amenable to our sample complexity analysis. We validate our proposed method on several benchmarks with promising results supporting our theoretical analysis.
The Impact of Geometric Complexity on Neural Collapse in Transfer Learning
Many of the recent remarkable advances in computer vision and language models can be attributed to the success of transfer learning via the pre-training of large foundation models. However, a theoretical framework which explains this empirical success is incomplete and remains an active area of research. Flatness of the loss surface and neural collapse have recently emerged as useful pre-training metrics which shed light on the implicit biases underlying pre-training. In this paper, we explore the geometric complexity of a model's learned representations as a fundamental mechanism that relates these two concepts. We show through experiments and theory that mechanisms which affect the geometric complexity of the pre-trained network also influence the neural collapse. Furthermore, we show how this effect of the geometric complexity generalizes to the neural collapse of new classes as well, thus encouraging better performance on downstream tasks, particularly in the few-shot setting.
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets-Supplementary Materials
Code is modified from https://github.com/coeusguo/ceit Module): def __init__ (self, dim, num_heads =8): super (). All the models are pre-trained on ImageNet-1K [1] only and then fine-tuned on CIFAR-100 [2] datasets. Results are shown in Table 1. We cite the reported results from corresponding papers. When fine-tuning our DHVT, we use AdamW optimizer with cosine learning rate scheduler and 2 warm-up epochs, a batch size of 256, an initial learning rate of 0.0005, weight decay of 1e-8, and fine-tuning epochs of 100.
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels.