Goto

Collaborating Authors

 trajectory optimization



DiffTORI: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning Weikang Wan

Neural Information Processing Systems

This paper introduces DiffTORI, which utilizes Diff erentiable T rajectory O ptimization as the policy representation to generate actions for deep R einforcement and I mitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function.







85ea6fd7a2ca3960d0cf5201933ac998-Supplemental.pdf

Neural Information Processing Systems

Theproof for Lemma A.1 can be found in Theorem 4 in [48]. Similar proof procedure also applies to prove the second inequality in (S.34). Proof by contradiction: suppose that the above second-order condition in (S.42)-(S.43) is false. Thus, in the following, we will ignore the computation process for obtainingξ(θ,γ) and simply view thatξ(θ,γ)is the twice continuously differentiable function of(θ,γ) near (θ,0) and ξθ = ξ(θ=θ,γ=0). Look at (S.62) and note that when = 0andγ = 0, (S.62) is identical to the first two equations in (S.59).


6af779991368999ab3da0d366c208fba-Paper-Conference.pdf

Neural Information Processing Systems

Planning enables autonomous agents to solve complex decision-making problems by evaluating predictions of the future. However, classical planning algorithms often become infeasible in real-world settings where state spaces are high-dimensional andtransitiondynamicsunknown.


DiffTORI: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning

Neural Information Processing Systems

Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTORI addresses the "objective mismatch" issue of prior model-based RL algorithms, as the dynamics model in DiffTORI is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTORI for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feedforward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model based RL tasks and 35 imitation learning tasks with high-dimensional image and point cloud inputs, DiffTORI outperforms prior state-of-the-art methods in both domains.