Supplementary Material T able of Contents
–Neural Information Processing Systems
A Laplace behavioral reference policy may be able to mitigate some of the problems posed by Proposition 1 due to the heavy tails of the distribution. Tikhonov regularization does not resolve the issue with calibration of uncertainties. A W AC performs online fine-tuning of a policy pre-trained on offline. BRAC regularizes the online policy against an offline behavioral policy as our method does. DAPG incorporates offline data into policy gradients by initially pre-training with a behaviorally cloned policy and then augmenting the RL loss with a supervised-learning loss.
Neural Information Processing Systems
Aug-18-2025, 16:49:48 GMT