bayesian neural network
Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks
Papagiannouli, Katerina, Trevisan, Dario, Zitto, Giuseppe Pio
We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian-process limits. Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level. We show that the posterior output rate function is obtained by a joint optimization over predictors and internal kernels, in contrast with fixed-kernel (NNGP) theory. Numerical experiments demonstrate that the resulting predictions accurately describe finite-width behavior for moderately sized networks, capturing non-Gaussian tails, posterior deformation, and data-dependent kernel selection effects.
Dirichlet Scale Mixture Priors for Bayesian Neural Networks
Arnstad, August, Rønneberg, Leiv, Storvik, Geir
Neural networks are the cornerstone of modern machine learning, yet can be difficult to interpret, give overconfident predictions and are vulnerable to adversarial attacks. Bayesian neural networks (BNNs) provide some alleviation of these limitations, but have problems of their own. The key step of specifying prior distributions in BNNs is no trivial task, yet is often skipped out of convenience. In this work, we propose a new class of prior distributions for BNNs, the Dirichlet scale mixture (DSM) prior, that addresses current limitations in Bayesian neural networks through structured, sparsity-inducing shrinkage. Theoretically, we derive general dependence structures and shrinkage results for DSM priors and show how they manifest under the geometry induced by neural networks. In experiments on simulated and real world data we find that the DSM priors encourages sparse networks through implicit feature selection, show robustness under adversarial attacks and deliver competitive predictive performance with substantially fewer effective parameters. In particular, their advantages appear most pronounced in correlated, moderately small data regimes, and are more amenable to weight pruning. Moreover, by adopting heavy-tailed shrinkage mechanisms, our approach aligns with recent findings that such priors can mitigate the cold posterior effect, offering a principled alternative to the commonly used Gaussian priors.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- (4 more...)
- Oceania > Australia > South Australia > Adelaide (0.05)
- Europe > United Kingdom > England > Surrey (0.05)
- Asia > Vietnam (0.05)
- North America > Canada > Ontario > Toronto (0.14)
- Oceania > Australia > South Australia > Adelaide (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- Asia > Vietnam (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
- Europe > Finland (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (2 more...)
- North America > United States > California (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States (0.15)
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > New York > New York County > New York City (0.15)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
cf9dc5e4e194fc21f397b4cac9cc3ae9-Paper.pdf
However, the structure of their hidden layer representations is only theoretically well-understood incertain infinite-width limits, inwhichtheserepresentations cannot flexibly adapt tolearn data-dependent features [3-11,24]. Inthe Bayesian setting, these representations are described by fixed, deterministic kernels [3-11].
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)