A Appendix

Neural Information Processing Systems 

The numbers in bold denote a significant statistical difference between the two methods (p-value < 0.001, paired t-test). We also list the IID (Table T6) and OOD (Tables T7, T8 and T9) test results of all the agents trained for this work. Some negative values should not surprise the reader, as some agents, when tested way outside of the training distribution, fail to walk, collecting more penalties (e.g., due to undesired contact force or excessive energy expenditure) than positive reward. We also show the graphs of the reward as a function for different perturbation intensity for the end-to-end trained Oracle, DMAP and TCN (Figure F2). Generally, DMAP performs similarly to the Oracle, while the TCN has lower performance especially for more challenging morphologies (Ant, Walker).