Bayesian Learning
CENSOR: Defense Against Gradient Inversion via Orthogonal Subspace Bayesian Sampling
Zhang, Kaiyuan, Cheng, Siyuan, Shen, Guangyu, Ribeiro, Bruno, An, Shengwei, Chen, Pin-Yu, Zhang, Xiangyu, Li, Ninghui
Federated learning collaboratively trains a neural network on a global server, where each local client receives the current global model weights and sends back parameter updates (gradients) based on its local private data. The process of sending these model updates may leak client's private data information. Existing gradient inversion attacks can exploit this vulnerability to recover private training instances from a client's gradient vectors. Recently, researchers have proposed advanced gradient inversion techniques that existing defenses struggle to handle effectively. In this work, we present a novel defense tailored for large neural network models. Our defense capitalizes on the high dimensionality of the model parameters to perturb gradients within a subspace orthogonal to the original gradient. By leveraging cold posteriors over orthogonal subspaces, our defense implements a refined gradient update mechanism. This enables the selection of an optimal gradient that not only safeguards against gradient inversion attacks but also maintains model utility. We conduct comprehensive experiments across three different datasets and evaluate our defense against various state-of-the-art attacks and defenses. Code is available at https://censor-gradient.github.io.
Approximate Message Passing for Bayesian Neural Networks
Sommerfeld, Romeo, Helms, Christian, Herbrich, Ralf
Bayesian neural networks (BNNs) offer the potential for reliable uncertainty quantification and interpretability, which are critical for trustworthy AI in high-stakes domains. In this work, we advance message passing (MP) for BNNs and present a novel framework that models the predictive posterior as a factor graph. To the best of our knowledge, our framework is the first MP method that handles convolutional neural networks and avoids double-counting training data, a limitation of previous MP methods that causes overconfidence. We evaluate our approach on CIFAR-10 with a convolutional neural network of roughly 890k parameters and find that it can compete with the SOTA baselines AdamW and IVON, even having an edge in terms of calibration. On synthetic data, we validate the uncertainty estimates and observe a strong correlation (0.9) between posterior credible intervals and its probability of covering the true data-generating function outside the training range. While our method scales to an MLP with 5.6 million parameters, further improvements are necessary to match the scale and performance of state-of-the-art variational inference methods. Deep learning models have achieved impressive results across various domains, including natural language processing (Vaswani et al., 2023), computer vision (Ravi et al., 2024), and autonomous systems (Bojarski et al., 2016). Yet, they often produce overconfident but incorrect predictions, particularly in ambiguous or out-of-distribution scenarios. Without the ability to effectively quantify uncertainty, this can foster both overreliance and underreliance on models, as users stop trusting their outputs entirely (Zhang et al., 2024), and in high-stakes domains like healthcare or autonomous driving, its application can be dangerous (Henne et al., 2020). To ensure safer deployment in these settings, models must not only predict outcomes but also express how uncertain they are about those predictions to allow for informed decision-making. Bayesian neural networks (BNNs) offer a principled way to quantify uncertainty by capturing a posterior distribution over the model's weights, rather than relying on point estimates as in traditional neural networks. This allows BNNs to express epistemic uncertainty, the model's lack of knowledge about the underlying data distribution.
Review for NeurIPS paper: Gibbs Sampling with People
Weaknesses: Overall, I thought this was a strong paper. The main concerns I had were as follows: (1) Mode-seeking versus showing the distribution: The aggregated results in the first experiment seem to show much more homogeneity than the results for GSP or MCMCP. It seems like one limitation of this approach might be that there is limited exploration of the space, perhaps making it hard to move between modes, and also makes it more difficult to see the full shape of the distribution, which I have often taken to be a goal in work using MCMCP. The movement between optimization and seeking a distribution is discussed to some extent in the paper, but I would be interested in seeing this discussed more (and perhaps whether GP without aggregation is likely to lead to more optimization than MCMCP). In the author response, they have shown additional information suggesting that GSP is more mode-seeking but also does a better job of capturing the distribution.
Review for NeurIPS paper: Gibbs Sampling with People
This paper introduces a new method for eliciting human representations of perceptual concepts, such as what RGB values people think correspond to the color "sunset" or what auditory dimensions (e.g. Rather than eliciting representations via guess-and-check (i.e., start with a dataset and then apply human-generated labels), this method (Gibbs Sampling with People, or GSP) enables inference to go in the other direction (i.e., start with labels, and then identify percepts that match those labels). GSP extends prior work (MCMC with People) to allow eliciting representations of much higher-dimensional stimuli. The reviewers unanimously praised this paper for tackling an important and relevant problem in cognitive science, for its breadth of empirical results, and for its novelty over prior work. R2 stated that the paper is "impressive in scale, scope, and results", R3 stated that it was "very relevant to the NeurIPS community and very novel", and R4 felt there could be "a potentially large impact of this work" with "substantial interest" amongst the NeurIPS community.
Reviews: BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning
My score remains the same. The methods proposed in the paper elegantly deals with the problem of redundant acquisition when using BALD in a greedy manner. I have a few questions and hope the authors can address them: (1) Does this problem of redundant acquisition only happen when one uses BALD as the score? Intuitively I would think no, as if one uses any score function greedily, regardless of the contribution of the other samples selected in the same batch, one can still end up with a biased batch that can potentially harm training. If this is the case, then why are var-ratios and mean-std outperforming random?
Reviews: BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning
The paper proposes BatchBALD, a batch acquisition function for sample selection in active learning. A greedy optimization algorithm is presented for efficient sample selection and BatchBALD score maximization. The reviewers and AC agree that this is an interesting work and that the approach is clearly presented and convincing. In addition the author response satisfactorily addresses the points raised in the reviews.
Reviews: Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks
One contribution is a new approach for training neural networks with binary activations. The second contribution is PAC-Bayesian generalization bounds for binary activated neural networks that, when used as the training objective, come very close to test accuracy (i.e. The gap between the training and test performance is also much smaller. I think this is very promising for training more robust networks. The method actually recovers variational Bayesian learning when the coefficient C is fixed, but in contrast to it, this coefficient is learned in a principled way.
A New Approach for Knowledge Generation Using Active Inference
Ghasimi, Jamshid, Movarraei, Nazanin
There are various models proposed on how knowledge is generated in the human brain including the semantic networks model. Although this model has been widely studied and even computational models are presented, but, due to various limits and inefficiencies in the generation of different types of knowledge, its application is limited to semantic knowledge because of has been formed according to semantic memory and declarative knowledge and has many limits in explaining various procedural and conditional knowledge. Given the importance of providing an appropriate model for knowledge generation, especially in the areas of improving human cognitive functions or building intelligent machines, improving existing models in knowledge generation or providing more comprehensive models is of great importance. In the current study, based on the free energy principle of the brain, is the researchers proposed a model for generating three types of declarative, procedural, and conditional knowledge. While explaining different types of knowledge, this model is capable to compute and generate concepts from stimuli based on probabilistic mathematics and the action-perception process (active inference). The proposed model is unsupervised learning that can update itself using a combination of different stimuli as a generative model can generate new concepts of unsupervised received stimuli. In this model, the active inference process is used in the generation of procedural and conditional knowledge and the perception process is used to generate declarative knowledge.
Reviews: A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families
Post-rebuttal: The authors have promised to incorporate an exposition of the sampler in the revised paper, I believe that will make the paper a more self-contained read. I maintain my rating of strong accept (8). I think this paper makes very nice contributions to the fundamental question of estimating the MLE distribution given a bunch of observations. I think the key contributions can be broken up into two key parts: - A bunch of simple but elegant structural results for the MLE distribution in terms of'tent distributions' -- distributions such that its log-density is piecewise linear, and is supported over subdivisions of the convex hull of the datapoints. This allows them to write a convex program for optimizing over tent distributions.
Reviews: A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families
The submission provides a polynomial-time approximation algorithm for finding the maximum-likelihood log-concave density for a given set of data points in R d, for arbitrary d. The work is theoretical in nature, with proofs and no experiments. The problem is very interesting, since log-concave distributions include may of the commonly used parametric families (such as Gaussian), and the log-concave MLE has also other interesting properties. Previously the sample-complexity of learning a log-concave distribution has been studied, but a polynomial-time algorithm has been lacking. The present work provides such an algorithm.