A Appendix A.1 Additional Method Justification The key idea of Q
–Neural Information Processing Systems
Since our objective in SLRL is to finish the task as soon as possible, and we may not be given expert demonstrations as prior data, we want to match the state-action pairs to those that lead to task completion. This problem has been studied in stochastic optimal control, particularly REPS [Peters et al., 2010]. A.2 Implementation Details and Hyperparameters In our experiments, we use soft actor-critic [Haarnoja et al., 2018] as our base RL algorithm. We use default hyperparameter values: a learning rate of 3e-4 for all networks, optimized using Adam, with a batch size of 256 sampled from the entire replay buffer (both prior and online data), a discount factor of 0.99. The policy and critic networks are MLPs with 2 fully-connected hidden layers of size 256.
Neural Information Processing Systems
May-30-2025, 08:35:59 GMT
- Technology: