Goto

Collaborating Authors

 Gradient Descent








Globally Convergent Policy Search for Output Estimation

Neural Information Processing Systems

We introduce the first direct policy search algorithm which provably converges to the globally optimal dynamic filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain an internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of informativity, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a regularizer which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a "reconditioning step" - converges to the globally optimal cost at a O (1 /T) rate.


Supplementary Material For Stochastic Multiple Target Sampling Gradient Descent

Neural Information Processing Systems

This consists of the following sections: Appendix 1 contains the proofs and derivations of our theory development. As a consequence, we obtain the conclusion of Equation (1). By choosing u to be a one hot vector at i, we obtain the conclusion of Lemma 1. 1.3 Derivations for the matrix U's formulation in Equation (3) We have ฯ• As a consequence, we obtain the conclusion of Equation (3). 3 1.4 Proof of Theorem 2 Before proving this theorem, let us re-state it: We have for all i = 1,...,K that D In this experiment, the three target distributions are created as presented in the main paper. Results are averaged over 5 runs. We take the best checkpoint in each approach based on the validation score.