Goto

Collaborating Authors

 structure 2


Observer-Feedback-Feedforward Controller Structures in Reinforcement Learning

arXiv.org Artificial Intelligence

The paper proposes the use of structured neural networks for reinforcement learning based nonlinear adaptive control. The focus is on partially observable systems, with separate neural networks for the state and feedforward observer and the state feedback and feedforward controller. The observer dynamics are modelled by recurrent neural networks while a standard network is used for the controller. As discussed in the paper, this leads to a separation of the observer dynamics to the recurrent neural network part, and the state feedback to the feedback and feedforward network. The structured approach reduces the computational complexity and gives the reinforcement learning based controller an {\em understandable} structure as compared to when one single neural network is used. As shown by simulation the proposed structure has the additional and main advantage that the training becomes significantly faster. Two ways to include feedforward structure are presented, one related to state feedback control and one related to classical feedforward control. The latter method introduces further structure with a separate recurrent neural network that processes only the measured disturbance. When evaluated with simulation on a nonlinear cascaded double tank process, the method with most structure performs the best, with excellent feedforward disturbance rejection gains.


Optimal Dynamic Regret in Proper Online Learning with Strongly Convex Losses and Beyond

arXiv.org Machine Learning

We study the framework of universal dynamic regret minimization with strongly convex losses. We answer an open problem in Baby and Wang 2021 by showing that in a proper learning setup, Strongly Adaptive algorithms can achieve the near optimal dynamic regret of $\tilde O(d^{1/3} n^{1/3}\text{TV}[u_{1:n}]^{2/3} \vee d)$ against any comparator sequence $u_1,\ldots,u_n$ simultaneously, where $n$ is the time horizon and $\text{TV}[u_{1:n}]$ is the Total Variation of comparator. These results are facilitated by exploiting a number of new structures imposed by the KKT conditions that were not considered in Baby and Wang 2021 which also lead to other improvements over their results such as: (a) handling non-smooth losses and (b) improving the dimension dependence on regret. Further, we also derive near optimal dynamic regret rates for the special case of proper online learning with exp-concave losses and an $L_\infty$ constrained decision set.