halfcheetah
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
A Hyperparameter Settings of RD
In this section, we describe details about hyperparameter setting of RD. SAC-N-Unc and TD3-N-Unc, M is set to 1/10 of the total training steps. To ensure fairness, algorithms employing RD are implemented using CORL repository [54]. By modifying the original SAC/TD3 algorithm to employ a critic ensemble of number N and incorporate an uncertainty regularization term within the policy update process, we derive these backbone algorithms. Additionally, using RD with fewer Q ensembles can achieve similar or even better results than the backbone methods using more Q ensembles, indicating its potential in reducing computing resource consumption.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Appendices ASketchofTheoreticalAnalyses
Theorem B.1 (Performance difference bound for Model-based RL). Mi denote the inconsistency between the learned dynamics PMi and the true dynamics, i.e. ϵ For L1 L3, with the performance gap approximation of M1 and π1, we apply Lemma C.2, and Here, dπMi denotes the distribution of state-action pair induced by policy π under the dynamical modelMi. Theorem B.3 (Refined bound with constraints). Let µ and v be two probability distributions on the configuration space X, according to LemmaC.1,thenwehaveDTV(µ Under these definitions, we can yield the following intermediate outcome by applying the results from B.2and B.1 Here, we take the time-varying linear quadratic regulator as an instance for illustrating the rationality of our assumption on α.
SupplementaryMaterialfor BAIL: Best-ActionImitationLearningfor BatchDeepReinforcementLearning
Note that ˆφ is feasible for the constrained optimization problem. We refer to it as an "early stopping scheme" because the key idea is to return to the parameter values which gave the lowest validation error (see Section 7.8 of Goodfellow et al.[3]). In our implementation, we initialize two upper envelope networks with parametersφ and φ0, where φ is trained using the penalty loss, andφ0 records the parameters with the lowest validation error encounteredsofar. IfLφ > Lφ0, we count the number of consecutive times this occurs. Notonlyis this not standard practice, but to makeafair comparison across all algorithms, this would require, foreachofthe fivealgorithms, performing aseparate hyper-parameter search foreachofthe five environments.
a8166da05c5a094f7dc03724b41886e5-Supplemental.pdf
For our specific algorithm, TD3+BC, given the performance gain over existing state-of-the-art methods is minimal, it would be surprising to see our paper result in significant impact in these contexts. ForCQLwemodify the GitHub defaults for the actor learning rate and use a fixedα rather than the Lagrange variant, matching thehyperparameters definedintheirpaper(whichdiffersfromtheGitHub), aswefound theoriginal hyperparameters performed better. We can also chooseλ by considering the value estimate of the agent-if we see divergence in the value function due to extrapolation error [Fujimoto et al., 2019], then we need to decreaseλ such that the BC term is weightedmorehighly. We use the default hyperparameters in the Fisher-BRC GitHub. Figure 1: Percent difference of performance of offline RL algorithms when adding normalization to state features.