Reinforcement Learning
Naive Exploration is Optimal for Online LQR
Simchowitz, Max, Foster, Dylan J.
We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which has been conjectured due to the apparent strong convexity of the problem. Our upper bounds are attained by a simple variant of \emph{certainty equivalence control}, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by Mania et al. 2019, we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Ricatti equations, which we call the \emph{self-bounding ODE method}. The approach enables regret upper bounds which hold for \emph{any stabilizable instance}, require no foreknowledge of the system except for a single stabilizing controller, and scale with natural control-theoretic quantities.
Tractable Reinforcement Learning of Signal Temporal Logic Objectives
Venkataraman, Harish, Aksaray, Derya, Seiler, Peter
Signal temporal logic (STL) is an expressive language to specify time-bound real-world robotic tasks and safety specifications. Recently, there has been an interest in learning optimal policies to satisfy STL specifications via reinforcement learning (RL). Learning to satisfy STL specifications often needs a sufficient length of state history to compute reward and the next action. The need for history results in exponential state-space growth for the learning problem. Thus the learning problem becomes computationally intractable for most real-world applications. In this paper, we propose a compact means to capture state history in a new augmented state-space representation. An approximation to the objective (maximizing probability of satisfaction) is proposed and solved for in the new augmented state-space. We show the performance bound of the approximate solution and compare it with the solution of an existing technique via simulations.
Facebook AI gives maps the brushoff in helping robots find the way
Facebook has scored an impressive feat involving AI that can navigate without any map. Facebook's wish for bragging rights, although they said they have a way to go, were evident in its blog post, "Near-perfect point-goal navigation from 2.5 billion frames of experience." Long story short, Facebook has delivered an algorithm that, quoting MIT Technology Review, lets robots find the shortest route in unfamiliar environments, opening the door to robots that can work inside homes and offices." And, in line with the plain-and-simple, Ubergizmo's Tyler Lee also remarked: "Facebook believes that with this new algorithm, it will be capable of creating robots that can navigate an area without the need for maps...in theory, you could place a robot in a room or an area without a map and it should be able to find its way to its destination." Erik Wijmans and Abhishek Kadian in the Facebook Jan. 21 post said that, well, after all, one of the technology key challenges is "teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination--without a preprovided map." Facebook has taken on the challenge. The two announced that Facebook AI created a large-scale distributed reinforcement learning algorithm called DD-PPO, "which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data," they wrote. DD-PPO stands for decentralized distributed proximal policy optimization. This is what Facebook is using to train agents and results seen in virtual environments such as houses and office buildings were encouraging. The bloggers pointed out that "even failing 1 out of 100 times is not acceptable in the physical world, where a robot agent might damage itself or its surroundings by making an error." Beyond DD-PPO, the authors gave credit to Facebook AI's open source AI Habitat platform for its "state-of-the-art speed and fidelity." AI Habitat made its open source announcement last year as a simulation platform to train embodied agents such as virtual robots in photo-realistic 3-D environments. Facebook said it was part of "Facebook AI's ongoing effort to create systems that are less reliant on large annotated data sets used for supervised training." InfoQ had said in July that "The technology was taking a different approach than relying upon static data sets which other researchers have traditionally used and that Facebook decided to open-source this technology to move this subfield forward." Jon Fingas in Engadget looked at how the team worked toward AI navigation (and this is where that 25 billion number comes in). "Previous projects tend to struggle without massive computational power.
RLgraph: Modular Computation Graphs for Deep Reinforcement Learning
Reinforcement learning (RL) tasks are challenging to implement, execute and test due to algorithmic instability, hyper-parameter sensitivity, and heterogeneous distributed communication patterns. We argue for the separation of logical component composition, backend graph definition, and distributed execution. To this end, we introduce RLgraph, a library for designing and executing reinforcement learning tasks in both static graph and define-by-run paradigms. The resulting implementations are robust, incrementally testable, and yield high performance across different deep learning frameworks and distributed backends.
Constrained Upper Confidence Reinforcement Learning
Zheng, Liyuan, Ratliff, Lillian J.
Constrained Markov Decision Processes are a class of stochastic decision problems in which the decision maker must select a policy that satisfies auxiliary cost constraints. This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori but the transition kernel is known. Such a setting is well-motivated by a number of applications including exploration of unknown, potentially unsafe, environments. We present an algorithm C-UCRL and show that it achieves sub-linear regret ($ O(T^{\frac{3}{4}}\sqrt{\log(T/\delta)})$) with respect to the reward while satisfying the constraints even while learning with probability $1-\delta$. Illustrative examples are provided.
Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors
Kรถster, Raphael, Hadfield-Menell, Dylan, Hadfield, Gillian K., Leibo, Joel Z.
How can societies learn to enforce and comply with social norms? Here we investigate the learning dynamics and emergence of compliance and enforcement of social norms in a foraging game, implemented in a multi-agent reinforcement learning setting. In this spatiotemporally extended game, individuals are incentivized to implement complex berry-foraging policies and punish transgressions against social taboos covering specific berry types. We show that agents benefit when eating poisonous berries is taboo, meaning the behavior is punished by other agents, as this helps overcome a credit-assignment problem in discovering delayed health effects. Critically, however, we also show that introducing an additional taboo, which results in punishment for eating a harmless berry, improves the rate and stability with which agents learn to punish taboo violations and comply with taboos. Counterintuitively, our results show that an arbitrary taboo (a "silly rule") can enhance social learning dynamics and achieve better outcomes in the middle stages of learning. We discuss the results in the context of studying normativity as a group-level emergent phenomenon.
Complete Intelligence Superforecasting Streamlined
The Complete Intelligence Global Cognitive System (GCS) was developed using Basic AI in 2015, and subsequently moved to a ML environment in 2016. In 2018, we expanded our analytic processes to harness the power of Deep Learning. At present, we are moving forward into the area of Reinforcement Learning to further improve our predictive efficiency. We begin our analytics with one of the world's largest global trade models that looks at more than 1,400 different industries and over 100 reporting countries. This is combined with thousands of commodities, equity indices, currencies and economic indicators to create a comprehensive model.
Stacked Auto Encoder Based Deep Reinforcement Learning for Online Resource Scheduling in Large-Scale MEC Networks
Jiang, Feibo, Wang, Kezhi, Dong, Li, Pan, Cunhua, Yang, Kun
An online resource scheduling framework is proposed for minimizing the sum of weighted task latency for all the mobile users, by optimizing offloading decision, transmission power, and resource allocation in the mobile edge computing (MEC) system. Towards this end, a deep reinforcement learning (DRL) method is proposed to obtain an online resource scheduling policy. Firstly, a related and regularized stacked auto encoder (2r-SAE) with unsupervised learning is proposed to perform data compression and representation for high dimensional channel quality information (CQI) data, which can reduce the state space for DRL. Secondly, we present an adaptive simulated annealing based approach (ASA) as the action search method of DRL, in which an adaptive h-mutation is used to guide the search direction and an adaptive iteration is proposed to enhance the search efficiency during the DRL process. Thirdly, a preserved and prioritized experience replay (2p-ER) is introduced to assist the DRL to train the policy network and find the optimal offloading policy. Numerical results are provided to demonstrate that the proposed algorithm can achieve near-optimal performance while significantly decreasing the computational time compared with existing benchmarks. It also shows that the proposed framework is suitable for resource scheduling problem in large-scale MEC networks, especially in the dynamic environment.
PCGRL: Procedural Content Generation via Reinforcement Learning
Khalifa, Ahmed, Bontrager, Philip, Earle, Sam, Togelius, Julian
We investigate how reinforcement learning can be used to train level-designing agents. This represents a new approach to procedural content generation in games, where level design is framed as a game, and the content generator itself is learned. By seeing the design problem as a sequential task, we can use reinforcement learning to learn how to take the next action so that the expected final level quality is maximized. This approach can be used when few or no examples exist to train from, and the trained generator is very fast. We investigate three different ways of transforming two-dimensional level design problems into Markov decision processes and apply these to three game environments.