Goto

Collaborating Authors

 simple baseline


A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including variational inference, MC dropout, KFAC Laplace, and temperature scaling.


Mining Unseen Classes via Regional Objectness: A Simple Baseline for Incremental Segmentation

Neural Information Processing Systems

Incremental or continual learning has been extensively studied for image classification tasks to alleviate catastrophic forgetting, a phenomenon in which earlier learned knowledge is forgotten when learning new concepts. For class incremental semantic segmentation, such a phenomenon often becomes much worse due to the semantic shift of the background class, \ie, some concepts learned at previous stages are assigned to the background class at the current training stage, therefore, significantly reducing the performance of these old concepts. To address this issue, we propose a simple yet effective method in this paper, named Mining unseen Classes via Regional Objectness (MicroSeg). Our MicroSeg is based on the assumption that \emph{background regions with strong objectness possibly belong to those concepts in the historical or future stages}. Therefore, to avoid forgetting old knowledge at the current training stage, our MicroSeg first splits the given image into hundreds of segment proposals with a proposal generator. Those segment proposals with strong objectness from the background are then clustered and assigned new defined labels during the optimization. In this way, the distribution characterizes of old concepts in the feature space could be better perceived, relieving the catastrophic forgetting caused by the semantic shift of the background class accordingly. We conduct extensive experiments on Pascal VOC and ADE20K, and competitive results well demonstrate the effectiveness of our MicroSeg.


A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including variational inference, MC dropout, KFAC Laplace, and temperature scaling.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Consensus Monte Carlo (CMC) is a method for parallelizing MCMC for posterior inference over large datasets. It works by factorizing the posterior distribution into sub-posteriors each of which depend on only a subset of datapoints, sampling from each of these sub-posteriors in parallel, and then transforming samples from the sub-posteriors using an aggregation function to samples from the real posterior. Existing works use very naive methods of aggregation which result in high bias, or are computationally very expensive, which make it difficult to use Consensus Monte Carlo in practice. This paper proposes a more principled way of combining samples by optimizing over aggregation functions using variational inference. Clarity: The paper is well written and easy to follow.


Review for NeurIPS paper: Distributionally Robust Federated Averaging

Neural Information Processing Systems

Additional Feedback: I have ready the response and other reviews. For concenrn about experimental evaluation, the response is mostly "[27] did it too", which is the concern I flag at the end of review. Not meeting a simple baseline is ignored. Moreover, [27] also tries to do experiment in a more realistic setup, where the simple baseline would not hold, which this work does not reproduce response does not mention. So I see this concern as not addressed.


Reviews: Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations.

Neural Information Processing Systems

Two or three relevant citations: Transformer models should probably be mentioned in the section on "models designed specifically for use on sequences", since they are competing heavily with the referenced baselines on NLP tasks especially. I believe your numbers on the Yelp dataset compare very favorably to the "sentiment neuron" work from Radford et al https://arxiv.org/abs/1704.01444 - that could be a nice addition and add further external context to your results. Some questions about the architecture, particularly the importance of the "additive skip connection" from input to output - how crucial is this connection, since it somewhat allows the network to bypass the TFiLM layers entirely? Does using a stacked skip (with free trainable parameters) still work, or does it hurt network training / break it completely? What is the SNR of the cubic interpolation used as input for the audio experiments?


Reviews: A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

The method is almost trivially simple, scalable and easy to implement, yet the empirical evaluation shows that it performs competitively and often better than all alternatives. This is the best kind of paper! The task of representing uncertainty over model weights is highly significant -- it is debatably *the* core problem in Bayesian deep learning, with (as the authors point out) applications to calibrated decision making, out-of-sample detection, adversarial robustness, transfer learning, and more. I expect this baseline to be widely used by researchers in the field, and likely implemented by practitioners as well. The paper is well written and easy to follow.


Reviews: A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

This paper presents SWAG, a method that uses the iterates of a Polyak-averaging-like stochastic gradient descent to approximate the posterior distribution of a neural network. It is presented as a simple baseline for uncertainty in large deep neural networks and the authors demonstrate its effectiveness on a variety of large scale tasks including residual networks on CIFAR and Imagenet. The strengths of this paper are: - it is indeed a simple baseline for a promising area of research that is really lacking good baselines - experiments are thorough and on benchmarks that are large and interesting to the wider deep learning community - the authors empirically evaluate the quality of their approximation and provide some analysis The main criticism of this paper is that it is not really Bayesian from a purist perspective. R3 is correct to point out that the presented approximation can not actually capture the true posterior as shown by Mandt et al. (Stochastic Gradient Descent as Approximate Bayesian Inference). The language of the paper at times implies otherwise and R3 is right to point this out (e.g.


Mining Unseen Classes via Regional Objectness: A Simple Baseline for Incremental Segmentation

Neural Information Processing Systems

Incremental or continual learning has been extensively studied for image classification tasks to alleviate catastrophic forgetting, a phenomenon in which earlier learned knowledge is forgotten when learning new concepts. For class incremental semantic segmentation, such a phenomenon often becomes much worse due to the semantic shift of the background class, \ie, some concepts learned at previous stages are assigned to the background class at the current training stage, therefore, significantly reducing the performance of these old concepts. To address this issue, we propose a simple yet effective method in this paper, named Mining unseen Classes via Regional Objectness (MicroSeg). Our MicroSeg is based on the assumption that \emph{background regions with strong objectness possibly belong to those concepts in the historical or future stages}. Therefore, to avoid forgetting old knowledge at the current training stage, our MicroSeg first splits the given image into hundreds of segment proposals with a proposal generator.


A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including variational inference, MC dropout, KFAC Laplace, and temperature scaling.