Export Reviews, Discussions, Author Feedback and Meta-Reviews
–Neural Information Processing Systems
The paper presents a new stochastic value gradient algorithm that can combine learning a system dynamics model with learning a (state-action) value function to obtain accurate gradients for the policy update. Value gradient algorithms could so far be used only for deterministic environments and deterministic policies. They extend the formulation to the stochastic case by introducing the noise as additional variables that are supposed to be known for the trajectories (and hence, deterministic). Furthermore, they propose to use a learned model (e.g. a neural network) to compute the gradient from multi-step predictions. Yet, the model is used only for computing the gradient and not for the prediction itself (real roleouts are used), which makes the approach less reliant on the model accuracy (in contrast to other model-based RL methods such as PILCO that suffer from model errors quite severly).
Neural Information Processing Systems
Feb-6-2025, 10:12:07 GMT
- Technology: