Trajectory optimizers for model-based reinforcement learning, such as the Cross-Entropy Method (CEM), can yield compelling results even in high-dimensional control tasks and sparse-reward environments. However, their sampling inefficiency prevents them from being used for real-time planning and control. We propose an improved version of the CEM algorithm for fast planning, with novel additions including temporally-correlated actions and memory, requiring 2.7-22x less samples and yielding a performance increase of 1.2-10x in high-dimensional control problems.
Also known as the phase retrieval problem, it is a fundamental challenge in nano-, bio- and astronomical imaging systems, astronomical imaging, and speech processing. The problem is ill-posed, and therefore additional assumptions on the signal and/or the measurements are necessary. In this paper, we first study the case where the underlying signal x is s-sparse. We develop a novel recovery algorithm that we call Compressive Phase Retrieval with Alternating Minimization, or CoPRAM. Our algorithm is simple and can be obtained via a natural combination of the classical alternating minimization approach for phase retrieval, with the CoSaMP algorithm for sparse recovery.
We study the sampling-based planning problem in Markov decision processes (MDPs) that we can access only through a generative model, usually referred to as Monte-Carlo planning. Our objective is to return a good estimate of the optimal value function at any state while minimizing the number of calls to the generative model, i.e. the sample complexity. We propose a new algorithm, TrailBlazer, able to handle MDPs with a finite or an infinite number of transitions from state-action to next states. TrailBlazer is an adaptive algorithm that exploits possible structures of the MDP by exploring only a subset of states reachable by following near-optimal policies. We provide bounds on its sample complexity that depend on a measure of the quantity of near-optimal states.
We study the computational tractability of PAC reinforcement learning with rich observations. We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. With stochastic hidden state dynamics, we prove that the only known sample-efficient algorithm, OLIVE, cannot be implemented in the oracle model. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.
Remedying this weakness is a key challenge in the quest for building intelligent agents that can learn continually when deployed in the real world, where their experiences are not necessarily i.i.d. and their resources may be limited. In my PhD, I have studied catastrophic forgetting in the context of deep reinforcement learning, where changes to the distribution of an agent's experiences arise from multiple sources and occur unpredictably over the course of learning. Inspired partially by the processes of synaptic consolidation and systems consolidation in the brain, I will present two methods that harness multi-timescale processes to mitigate catastrophic forgetting in an RL setting. Bio: Christos is currently pursuing a PhD on the topic of Continual Reinforcement Learning at Imperial College London, co-supervised by Claudia Clopath (Bioengineering) and Murray Shanahan (Computing). He graduated with a BA in Applied Mathematics from Harvard and worked as a trader at Brevan Howard for several years, before leaving to pursue MScs in Computing and Informatics at Imperial College and Edinburgh University respectively, driven by an interest in computational neuroscience and machine learning. In April, he will start a job as a Research Scientist at DeepMind.