Reinforcement Learning
Fast Design Space Adaptation with Deep Reinforcement Learning for Analog Circuit Sizing
Yang, Kai-En, Tsai, Chia-Yu, Shen, Hung-Hao, Chiang, Chen-Feng, Tsai, Feng-Ming, Wang, Chung-An, Ting, Yiju, Yeh, Chia-Shun, Lai, Chin-Tang
We present a novel framework for design space search on analog circuit sizing using deep reinforcement learning (DRL). Nowadays, analog circuit design is a manual routine that requires heavy design efforts due to the absence of automation tools, motivating the urge to develop one. Prior approaches cast this process as an optimization problem. They use global search strategies based on DRL with complex network architectures. Nonetheless, the models are hard to converge and neglected various working conditions of PVT (process, voltage, temperature).In this work, we reduce the problem to a constraint satisfaction problem, where a local strategy is adopted. Thus, a simple feed-forward network with few layers can be used to implement a model-based reinforcement learning agent. To evaluate the value of the our framework in production, we cooperate with R&Ds in an IC design company. On circuits with TSMC advanced 5 and 6nm process, our agents can deliver PPA (performance, power, area) beyond human level. Furthermore, the product will be taped out in the near future.
Logistic $Q$-Learning
Bas-Serrano, Joan, Curi, Sebastian, Krause, Andreas, Neu, Gergely
While REPS is elegantly derived from a principled We propose a new reinforcement learning algorithm linear-programing (LP) formulation of optimal control derived from a regularized linearprogramming in MDPs, it has the serious shortcoming that its faithful formulation of optimal control implementation requires access to the true MDP in MDPs. The method is closely related to for both the policy evaluation and improvement steps, the classic Relative Entropy Policy Search even at deployment time. The usual way to address (REPS) algorithm of Peters et al. (2010), with this limitation is to use an empirical approximation to the key difference that our method introduces the policy evaluation step and to project the policy a Q-function that enables efficient exact from the improvement step into a parametric space model-free implementation. The main (Deisenroth et al., 2013), losing all the theoretical feature of our algorithm (called Q-REPS) is guarantees of REPS in the process.
Deep Q-Network-based Adaptive Alert Threshold Selection Policy for Payment Fraud Systems in Retail Banking
Machine learning models have widely been used in fraud detection systems. Most of the research and development efforts have been concentrated on improving the performance of the fraud scoring models. Yet, the downstream fraud alert systems still have limited to no model adoption and rely on manual steps. Alert systems are pervasively used across all payment channels in retail banking and play an important role in the overall fraud detection process. Current fraud detection systems end up with large numbers of dropped alerts due to their inability to account for the alert processing capacity. Ideally, alert threshold selection enables the system to maximize the fraud detection while balancing the upstream fraud scores and the available bandwidth of the alert processing teams. However, in practice, fixed thresholds that are used for their simplicity do not have this ability. In this paper, we propose an enhanced threshold selection policy for fraud alert systems. The proposed approach formulates the threshold selection as a sequential decision making problem and uses Deep Q-Network based reinforcement learning. Experimental results show that this adaptive approach outperforms the current static solutions by reducing the fraud losses as well as improving the operational efficiency of the alert system.
Improving Generalization in Reinforcement Learning with Mixture Regularization
Wang, Kaixin, Kang, Bingyi, Shao, Jie, Feng, Jiashi
Deep reinforcement learning (RL) agents trained in a limited set of environments tend to suffer overfitting and fail to generalize to unseen testing environments. To improve their generalizability, data augmentation approaches (e.g. cutout and random convolution) are previously explored to increase the data diversity. However, we find these approaches only locally perturb the observations regardless of the training environments, showing limited effectiveness on enhancing the data diversity and the generalization performance. In this work, we introduce a simple approach, named mixreg, which trains agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations. Mixreg increases the data diversity more effectively and helps learn smoother policies. We verify its effectiveness on improving generalization by conducting extensive experiments on the large-scale Procgen benchmark. Results show mixreg outperforms the well-established baselines on unseen testing environments by a large margin. Mixreg is simple, effective and general. It can be applied to both policy-based and value-based RL algorithms. Code is available at https://github.com/kaixin96/mixreg .
Negotiating Team Formation Using Deep Reinforcement Learning
Bachrach, Yoram, Everett, Richard, Hughes, Edward, Lazaridou, Angeliki, Leibo, Joel Z., Lanctot, Marc, Johanson, Michael, Czarnecki, Wojciech M., Graepel, Thore
When autonomous agents interact in the same environment, they must often cooperate to achieve their goals. One way for agents to cooperate effectively is to form a team, make a binding agreement on a joint plan, and execute it. However, when agents are self-interested, the gains from team formation must be allocated appropriately to incentivize agreement. Various approaches for multi-agent negotiation have been proposed, but typically only work for particular negotiation protocols. More general methods usually require human input or domain-specific data, and so do not scale. To address this, we propose a framework for training agents to negotiate and form teams using deep reinforcement learning. Importantly, our method makes no assumptions about the specific negotiation protocol, and is instead completely experience driven. We evaluate our approach on both non-spatial and spatially extended team-formation negotiation environments, demonstrating that our agents beat hand-crafted bots and reach negotiation outcomes consistent with fair solutions predicted by cooperative game theory. Additionally, we investigate how the physical location of agents influences negotiation outcomes.
Multi-Radar Tracking Optimization for Collaborative Combat
Nour, Nouredine, Belhaj-Soullami, Reda, Buron, Cédric, Peres, Alain, Barbaresco, Frédéric
Despite great interest in recent research, in particular in China [1, 2] micromanagement of sensors by centralized command and control drives possible inefficiencies and risk into operations. Tactical decision making and execution by headquarters usually fail to achieve the speed necessary to meet rapid changes. Collaborative radars with C2 must provide decision superiority despite the attempts of an adversary to disrupt OODA cycles at all level of operations. Artificial intelligence can make a contribution for the purposes of coordinated conduct of the action, by improving the response time to threats and optimizing the allocation and the distribution of tasks within elementary smart radars. In order to address this problem, Thales and the private research lab NukkAI have been collaborating to introduce novel approaches for netted radars. Thales provided the simulation modeling the multi-radar target allocation problem and NukkAI proposed two novel reward-based learning approaches for the problem. In this paper, we present these two approaches: Evolutionary Single-Target Ordering (ESTO), which is based on evolution strategies and an RL approach based on Actor-Critic methods. To make the RL method tractable in practice, we introduce a simplification of the problem that we prove to be equivalent to solving the initial formulation. We evaluate our solutions on diverse scenarios of the aforementioned simulation.
Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification
Mankowitz, Daniel J., Calian, Dan A., Jeong, Rae, Paduraru, Cosmin, Heess, Nicolas, Dathathri, Sumanth, Riedmiller, Martin, Mann, Timothy
Many real-world physical control systems are required to satisfy constraints upon deployment. Furthermore, real-world systems are often subject to effects such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such effects effectively perturb the system dynamics and can cause a policy trained successfully in one domain to perform poorly when deployed to a perturbed version of the same domain. This can affect a policy's ability to maximize future rewards as well as the extent to which it satisfies constraints. We refer to this as constrained model misspecification. We present an algorithm with theoretical guarantees that mitigates this form of misspecification, and showcase its performance in multiple Mujoco tasks from the Real World Reinforcement Learning (RWRL) suite.
Runtime Safety Assurance Using Reinforcement Learning
Lazarus, Christopher, Lopez, James G., Kochenderfer, Mykel J.
The airworthiness and safety of a non-pedigreed autopilot must be verified, but the cost to formally do so can be prohibitive. We can bypass formal verification of non-pedigreed components by incorporating Runtime Safety Assurance (RTSA) as mechanism to ensure safety. RTSA consists of a meta-controller that observes the inputs and outputs of a non-pedigreed component and verifies formally specified behavior as the system operates. When the system is triggered, a verified recovery controller is deployed. Recovery controllers are designed to be safe but very likely disruptive to the operational objective of the system, and thus RTSA systems must balance safety and efficiency. The objective of this paper is to design a meta-controller capable of identifying unsafe situations with high accuracy. High dimensional and non-linear dynamics in which modern controllers are deployed along with the black-box nature of the nominal controllers make this a difficult problem. Current approaches rely heavily on domain expertise and human engineering. We frame the design of RTSA with the Markov decision process (MDP) framework and use reinforcement learning (RL) to solve it. Our learned meta-controller consistently exhibits superior performance in our experiments compared to our baseline, human engineered approach.
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Kompella, Varun, Capobianco, Roberto, Jong, Stacy, Browne, Jonathan, Fox, Spencer, Meyers, Lauren, Wurman, Peter, Stone, Peter
The year 2020 has seen the COVID-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world are faced with the challenge of protecting public health, while keeping the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention policies. However, to date,the even the most data-driven intervention policies rely on heuristics. In this paper, we study how reinforcement learning (RL) can be used to optimize mitigation policies that minimize the economic impact without overwhelming the hospital capacity. Our main contributions are (1) a novel agent-based pandemic simulator which, unlike traditional models, is able to model fine-grained interactions among people at specific locations in a community; and (2) an RL-based methodology for optimizing fine-grained mitigation policies within this simulator. Our results validate both the overall simulator behavior and the learned policies under realistic conditions.