Goto

Collaborating Authors

 control objective function


1 2 " Xt Ut # 0 " Hxxt Hxut Huxt Huut

Neural Information Processing Systems

Based onLemma 5.1anditsproof, weknownthatthePMP oftheauxiliary control system, (S.2), is exactly the differential PMP equations (13). Thus below, we only look at the differential PMP equationsin(S.2). In the system identification experiment, we collect a total number of five trajectories from systems (in Table 2) with dynamics known, wherein different trajectoriesξo = {xo0:T,u0:T 1}havedifferent initial conditionsx0 andhorizonsT (T ranges from10to20),with randominputsu0:T 1 drawnfromuniformdistribution. In fact, throughout the entire learning process, PDP always guarantees that the policyconstraint isperfectly respected (as the forward pass strictly follows the policy). Please seeAppendix Fig. S4for validation.



control system Σ (ξ

Neural Information Processing Systems

To prove Lemma 5.1, we just need to show that the Pontryagin's Maximum Principle for the auxiliary X, (S.3) and the following matrix trace properties: Tr(A) = Tr( A Since the above obtained PMP equations (S.2) are the same with the differential PMP in (13), we thus Based on Lemma 5.1 and its proof, we known that the PMP of the auxiliary control system, (S.2), is exactly the differential PMP equations (13). From (S.2c), we solve for U Proof by induction: (S.2d) shows that (S.8) holds for This completes the proof. 2 D Algorithms Details for Different Learning Modes SysID Mode, then use the learned dynamics as the initial guess in IRL/IOC Mode. In design of the quadrotor's control objective function, to achieve SE (3) maneuvering In Fig. S1, we show more detailed results of imitation loss versus iteration In Fig. S2, we show more detailed results of SysID loss versus iteration In Fig. S5, we use the On the cart-pole and robot-arm systems (in Figure 1a and Figure 1b), we learn a feedback policy by minimizing given control objective functions. In Fig. S3, we show the detailed results of control loss (i.e. the value of control objective S6, we have the following remarks. This can be seen in Fig. S3 and Fig. S6 (in Fig. S6, PDP results in a simulated trajectory which is closer to the optimal one than that This explains why PDP outperforms GPS in terms of having lower control cost (loss).