to

### Machine learning approach could improve radar in congested environments - Military Embedded Systems

Research being conducted by the U.S. Army Combat Capabilities Development Command (DEVCOM) is focused on a new machine learning approach that could improve radar performance in congested environments. Researchers from DEVCOM, Army Research Laboratory, and Virginia Tech have developed an automatic way for radars to operate in congested and limited-spectrum environments created by commercial 4G LTE and future 5G communications systems. The researchers claim they examined how future Department of Defense radar systems will share the spectrum with commercial communications systems. The team used machine learning to learn the behavior of ever-changing interference in the spectrum and find clean spectrum to maximize the radar performance. Once clean spectrum is identified, waveforms can be modified to best fit into the spectrum.

### Go with the Flow: Adaptive Control for Neural ODEs

Despite their elegant formulation and lightweight memory cost, neural ordinary differential equations (NODEs) suffer from known representational limitations. In particular, the single flow learned by NODEs cannot express all homeomorphisms from a given data space to itself, and their static weight parametrization restricts the type of functions they can learn compared to discrete architectures with layer-dependent weights. Here, we describe a new module called neurally-controlled ODE (N-CODE) designed to improve the expressivity of NODEs. The parameters of N-CODE modules are dynamic variables governed by a trainable map from initial or current activation state, resulting in forms of open-loop and closed-loop control, respectively. A single module is sufficient for learning a distribution on non-autonomous flows that adaptively drive neural representations. We provide theoretical and empirical evidence that N-CODE circumvents limitations of previous models and show how increased model expressivity manifests in several domains. In supervised learning, we demonstrate that our framework achieves better performance than NODEs as measured by both training speed and testing accuracy. In unsupervised learning, we apply this control perspective to an image Autoencoder endowed with a latent transformation flow, greatly improving representational power over a vanilla model and leading to state-of-the-art image reconstruction on CIFAR-10.

### Complementary Meta-Reinforcement Learning for Fault-Adaptive Control

Faults are endemic to all systems. Adaptive fault-tolerant control maintains degraded performance when faults occur as opposed to unsafe conditions or catastrophic events. In systems with abrupt faults and strict time constraints, it is imperative for control to adapt quickly to system changes to maintain system operations. We present a meta-reinforcement learning approach that quickly adapts its control policy to changing conditions. The approach builds upon model-agnostic meta learning (MAML). The controller maintains a complement of prior policies learned under system faults. This "library" is evaluated on a system after a new fault to initialize the new policy. This contrasts with MAML, where the controller derives intermediate policies anew, sampled from a distribution of similar systems, to initialize a new policy. Our approach improves sample efficiency of the reinforcement learning process. We evaluate our approach on an aircraft fuel transfer system under abrupt faults.

### Explore More and Improve Regret in Linear Quadratic Regulators

Stabilizing the unknown dynamics of a control system and minimizing regret in control of an unknown system are among the main goals in control theory and reinforcement learning. In this work, we pursue both these goals for adaptive control of linear quadratic regulators (LQR). Prior works accomplish either one of these goals at the cost of the other one. The algorithms that are guaranteed to find a stabilizing controller suffer from high regret, whereas algorithms that focus on achieving low regret assume the presence of a stabilizing controller at the early stages of agent-environment interaction. In the absence of such a stabilizing controller, at the early stages, the lack of reasonable model estimates needed for (i) strategic exploration and (ii) design of controllers that stabilize the system, results in regret that scales exponentially in the problem dimensions. We propose a framework for adaptive control that exploits the characteristics of linear dynamical systems and deploys additional exploration in the early stages of agent-environment interaction to guarantee sooner design of stabilizing controllers. We show that for the classes of controllable and stabilizable LQRs, where the latter is a generalization of prior work, these methods achieve $\tilde{\mathcal{O}}(\sqrt{T})$ regret with a polynomial dependence in the problem dimensions.

### Dynamic Bidding Strategies with Multivariate Feedback Control for Multiple Goals in Display Advertising

Real-Time Bidding (RTB) display advertising is a method for purchasing display advertising inventory in auctions that occur within milliseconds. The performance of RTB campaigns is generally measured with a series of Key Performance Indicators (KPIs) - measurements used to ensure that the campaign is cost-effective and that it is purchasing valuable inventory. While an RTB campaign should ideally meet all KPIs, simultaneous improvement tends to be very challenging, as an improvement to any one KPI risks a detrimental effect toward the others. Here we present an approach to simultaneously controlling multiple KPIs with a PID-based feedback-control system. This method generates a control score for each KPI, based on both the output of a PID controller module and a metric that quantifies the importance of each KPI for internal business needs. On regular intervals, this algorithm - Sequential Control - will choose the KPI with the greatest overall need for improvement. In this way, our algorithm is able to continually seek the greatest marginal improvements to its current state. Multiple methods of control can be associated with each KPI, and can be triggered either simultaneously or chosen stochastically, in order to avoid local optima. In both offline ad bidding simulations and testing on live traffic, our methods proved to be effective in simultaneously controlling multiple KPIs, and bringing them toward their respective goals.

### Technical Report: Adaptive Control for Linearizable Systems Using On-Policy Reinforcement Learning

This paper proposes a framework for adaptively learning a feedback linearization-based tracking controller for an unknown system using discrete-time model-free policy-gradient parameter update rules. The primary advantage of the scheme over standard model-reference adaptive control techniques is that it does not require the learned inverse model to be invertible at all instances of time. This enables the use of general function approximators to approximate the linearizing controller for the system without having to worry about singularities. However, the discrete-time and stochastic nature of these algorithms precludes the direct application of standard machinery from the adaptive control literature to provide deterministic stability proofs for the system. Nevertheless, we leverage these techniques alongside tools from the stochastic approximation literature to demonstrate that with high probability the tracking and parameter errors concentrate near zero when a certain persistence of excitation condition is satisfied. A simulated example of a double pendulum demonstrates the utility of the proposed theory. 1 I. INTRODUCTION Many real-world control systems display nonlinear behaviors which are difficult to model, necessitating the use of control architectures which can adapt to the unknown dynamics online while maintaining certificates of stability. There are many successful model-based strategies for adaptively constructing controllers for uncertain systems [1], [2], [3], but these methods often require the presence of a simple, reasonably accurate parametric model of the system dynamics. Recently, however, there has been a resurgence of interest in the use of model-free reinforcement learning techniques to construct feedback controllers without the need for a reliable dynamics model [4], [5], [6]. As these methods begin to be deployed in real world settings, a new theory is needed to understand the behavior of these algorithms as they are integrated into safety-critical control loops.

### Regret Bound of Adaptive Control in Linear Quadratic Gaussian (LQG) Systems

One of the core challenges in the field of control theory and reinforcement learning is adaptive control. It is the problem of controlling dynamical systems when the dynamics of the systems are unknown to the decision-making agents. In adaptive control, agents interact with given systems in order to explore and control them while the long-term objective is to minimize the overall average associated costs. The agent has to balance between exploration and exploitation, learn the dynamics, strategize for further exploration, and exploit the estimation to minimize the overall costs. The sequential nature of agent-system interaction results in challenges in the system identifying, estimation, and control under uncertainty, and these challenges are magnified when the systems are partially observable, i.e. contain hidden underlying dynamics. In the linear systems, when the underlying dynamics are fully observable, the asymptotic optimality of estimation methods has been the topic of study in the last decades [Lai et al., 1982, Lai and Wei, 1987]. Recently, novel techniques and learning algorithms have been developed to study the finite-time behavior of adaptive control algorithms and shed light on the design of optimal methods [Peña et al., 2009, Fiechter, 1997, Abbasi-Yadkori and Szepesvári, 2011]. In particular, Abbasi-Yadkori and Szepesvári [2011] proposes to use the principle of optimism in the face of uncertainty (OFU) to balance exploration and exploitation in LQR, where the state of the system is observable.

### Robust Learning-Based Control via Bootstrapped Multiplicative Noise

Despite decades of research and recent progress in adaptive control and reinforcement learning, there remains a fundamental lack of understanding in designing controllers that provide robustness to inherent non-asymptotic uncertainties arising from models estimated with finite, noisy data. We propose a robust adaptive control algorithm that explicitly incorporates such non-asymptotic uncertainties into the control design. The algorithm has three components: (1) a least-squares nominal model estimator; (2) a bootstrap resampling method that quantifies non-asymptotic variance of the nominal model estimate; and (3) a non-conventional robust control design method using an optimal linear quadratic regulator (LQR) with multiplicative noise. A key advantage of the proposed approach is that the system identification and robust control design procedures both use stochastic uncertainty representations, so that the actual inherent statistical estimation uncertainty directly aligns with the uncertainty the robust controller is being designed against. We show through numerical experiments that the proposed robust adaptive controller can significantly outperform the certainty equivalent controller on both expected regret and measures of regret risk.

### Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

We consider adaptive control of the Linear Quadratic Regulator (LQR), where an unknown linear system is controlled subject to quadratic costs. Leveraging recent developments in the estimation of linear systems and in robust controller synthesis, we present the first provably polynomial time algorithm that achieves sub-linear regret on this problem. We further study the interplay between regret minimization and parameter estimation by proving a lower bound on the expected regret in terms of the exploration schedule used by any algorithm. Finally, we conduct a numerical study comparing our robust adaptive algorithm to other methods from the adaptive LQR literature, and demonstrate the flexibility of our proposed method by extending it to a demand forecasting problem subject to state constraints. Papers published at the Neural Information Processing Systems Conference.

### Improper Learning for Non-Stochastic Control

We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states, known as non-stochastic control. We introduce a controller parametrization based on the denoised observations, and prove that applying online gradient descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies. In the fully-adversarial setting, our controller attains an optimal regret bound of $\sqrt{T}$-when the system is known, and, when combined with an initial stage of least-squares estimation, $T^{2/3}$ when the system is unknown; both yield the first sublinear regret for the partially observed setting. Our bounds are the first in the non-stochastic control setting that compete with \emph{all} stabilizing linear dynamical controllers, not just state feedback. Moreover, in the presence of semi-adversarial noise containing both stochastic and adversarial components, our controller attains the optimal regret bounds of $\mathrm{poly}(\log T)$ when the system is known, and $\sqrt{T}$ when unknown. To our knowledge, this gives the first end-to-end $\sqrt{T}$ regret for online Linear Quadratic Gaussian controller, and applies in a more general setting with adversarial losses and semi-adversarial noise.