SupplementaryMaterial
–Neural Information Processing Systems
Letπ0( |s)beaGaussianbehavioral reference policy with meanµ0(s) and variance σ20(s), and let π( |s) be an online policy with reparameterization at = fφ( t;st)andrandomvector t. Whilstentropyregularization partially mitigates the collapse of predictive variance away from the expert demonstrations, we still observe the wrong trend similar to Figure 1 with predictive variances high near the expert demonstrations andlowonunseen data. AWAC performs online fine-tuning of a policy pre-trained on offline. Themethod requires additional off-policy data to be generated to saturate the replay buffer, thereby requiring ahidden number ofenvironment interactions that donotinvolvelearning. To mitigate this, in practice, BRAC adds an entropy bonus to the supervised learning objective which stabilizes the variance around the training set but has no guarantees away from thedata.
Neural Information Processing Systems
Feb-11-2026, 19:37:35 GMT
- Technology: