online policy
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (4 more...)
- Asia > India > West Bengal > Kolkata (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Education (0.46)
- Banking & Finance (0.46)
- Information Technology (0.46)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- (2 more...)
SupplementaryMaterial
Letπ0( |s)beaGaussianbehavioral reference policy with meanµ0(s) and variance σ20(s), and let π( |s) be an online policy with reparameterization at = fφ( t;st)andrandomvector t. Whilstentropyregularization partially mitigates the collapse of predictive variance away from the expert demonstrations, we still observe the wrong trend similar to Figure 1 with predictive variances high near the expert demonstrations andlowonunseen data. AWAC performs online fine-tuning of a policy pre-trained on offline. Themethod requires additional off-policy data to be generated to saturate the replay buffer, thereby requiring ahidden number ofenvironment interactions that donotinvolvelearning. To mitigate this, in practice, BRAC adds an entropy bonus to the supervised learning objective which stabilizes the variance around the training set but has no guarantees away from thedata.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > India > West Bengal > Kolkata (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Education (0.46)
- Banking & Finance (0.46)
- Information Technology (0.46)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- (2 more...)
- North America > United States > New York > New York County > New York City (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- (5 more...)
Supplementary Material T able of Contents
A Laplace behavioral reference policy may be able to mitigate some of the problems posed by Proposition 1 due to the heavy tails of the distribution. Tikhonov regularization does not resolve the issue with calibration of uncertainties. A W AC performs online fine-tuning of a policy pre-trained on offline. BRAC regularizes the online policy against an offline behavioral policy as our method does. DAPG incorporates offline data into policy gradients by initially pre-training with a behaviorally cloned policy and then augmenting the RL loss with a supervised-learning loss.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)