Goto

Collaborating Authors

 adaptive regularization


Uncertainty-based Continual Learning with Adaptive Regularization

Neural Information Processing Systems

We introduce a new neural network-based continual learning algorithm, dubbed as Uncertainty-regularized Continual Learning (UCL), which builds on traditional Bayesian online learning framework with variational inference. We focus on two significant drawbacks of the recently proposed regularization-based methods: a) considerable additional memory cost for determining the per-weight regularization strengths and b) the absence of gracefully forgetting scheme, which can prevent performance degradation in learning new tasks. In this paper, we show UCL can solve these two problems by introducing a fresh interpretation on the Kullback-Leibler (KL) divergence term of the variational lower bound for Gaussian mean-field approximation. Based on the interpretation, we propose the notion of node-wise uncertainty, which drastically reduces the number of additional parameters for implementing per-weight regularization. Moreover, we devise two additional regularization terms that enforce \emph{stability} by freezing important parameters for past tasks and allow \emph{plasticity} by controlling the actively learning parameters for a new task. Through extensive experiments, we show UCL convincingly outperforms most of recent state-of-the-art baselines not only on popular supervised learning benchmarks, but also on challenging lifelong reinforcement learning tasks. The source code of our algorithm is available at https://github.com/csm9493/UCL.


Learning Neural Networks with Adaptive Regularization

Neural Information Processing Systems

Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another.


Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Neural Information Processing Systems

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a focus module, which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.


Dropout Training as Adaptive Regularization

Neural Information Processing Systems

Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an \LII regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learner, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer.


Learning Neural Networks with Adaptive Regularization

Neural Information Processing Systems

Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another.


Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Neural Information Processing Systems

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.


Reviews: Uncertainty-based Continual Learning with Adaptive Regularization

Neural Information Processing Systems

Summary: The paper presents a regularization-based continual learning method, UCL, where during the training of the current task the parameters of the network are regularized based on their uncertainty in the previous tasks (less uncertainty means that a parameter is important and should not be altered in future tasks). Instead of measuring the uncertainty at the parameter level, as done in the earlier works (e.g.) Variational Continual Learning (VCL), the authors propose to measure uncertainty over the neurons resulting in less number of learnable parameters (mean and variances) to store. To compute the neurons uncertainty, UCL imposes a constraint that all the weights going into a neuron share the same/ common variance. To learn the parameters, a variational objective is used where the authors cleverly opened up the KL term in the ELBO and play with it to impose constraints on the variances of different neurons. The results are reported on MNIST benchmarks and RL tasks.


Reviews: Uncertainty-based Continual Learning with Adaptive Regularization

Neural Information Processing Systems

This paper proposed uncertainty-regularized continue learning (UCL) to address the challenge of catastrophe forgetting of neural networks. In detail, the method improves over variational continual learning (VCL) by modifying the KL regularizer in mean-field Gaussian prior/posterior setting. The approach is mainly justified by intuition explanation rather than theoretical/mathematical arguments. Experiments are performed on supervised continual learning benchmarks (split and permuted MNIST), and the method shows dominating performance over previous baselines (VCL, SI, EWC, HAT). Reviewers include experts in continual learning.


Reviews: Learning Neural Networks with Adaptive Regularization

Neural Information Processing Systems

Statistical Strength Throughoutt the paper, you refer to the concept of'statistical strength' without describing what it actually means. I expect it means that if two things are correlated, you can estimate properties of them with better sample efficiency if you take this correlation into account, since you're effectively getting more data. Given that two features are correlated, optimization will be improved if you do some sort of preconditioning that accounts for this structure. In other words, given that features are correlated, you want to'share statistical strength.' However, it is less clear to me why you want to regularize the model such that things become correlated/anti-correlated.


Reviews: Learning Neural Networks with Adaptive Regularization

Neural Information Processing Systems

Borderline paper leaning to accept. All reviewers liked the paper but even after rebuttal have a minor concern regarding originality.