Reinforcement learning algorithms are known to be sample inefficient, and often performance on one task can be substantially improved by leveraging information (e.g., via pre-training) on other related tasks. In this work, we propose a technique to achieve such knowledge transfer in cases where agent trajectories contain sensitive or private information, such as in the healthcare domain. Our approach leverages a differentially private policy evaluation algorithm to initialize an actor-critic model and improve the effectiveness of learning in downstream tasks. We empirically show this technique increases sample efficiency in resource-constrained control problems while preserving the privacy of trajectories collected in an upstream task.
At a 2017 O'Reilly AI conference, Andrew Ng ranked reinforcement learning dead last in terms of its utility for business applications. Compared to other machine learning methods like supervised learning, transfer learning, and even unsupervised learning, deep reinforcement learning (RL) is incredibly data hungry, often unstable, and rarely the best option in terms of performance. RL has historically been successfully applied only in arenas where mountains of simulated data can be generated on demand, such as games and robotics. Despite RL's limitations in solving business use cases, some AI experts believe this approach is the most viable strategy for achieving human or superhuman Artificial General Intelligence (AGI). The recent victory of DeepMind's AlphaStar over top-ranked professional StarCraft players suggests we might be on the cusp of applying deep RL to real world problems with real-time demands, extraordinary complexity, and incomplete information.
Deep reinforcement learning (DRL) has made great achievements since proposed. Generally, DRL agents receive high-dimensional inputs at each step, and make actions according to deep-neural-network-based policies. This learning mechanism updates the policy to maximize the return with an end-to-end method. In this paper, we survey the progress of DRL methods, including value-based, policy gradient, and model-based algorithms, and compare their main techniques and properties. Besides, DRL plays an important role in game artificial intelligence (AI). We also take a review of the achievements of DRL in various video games, including classical Arcade games, first-person perspective games and multi-agent real-time strategy games, from 2D to 3D, and from single-agent to multi-agent. A large number of video game AIs with DRL have achieved super-human performance, while there are still some challenges in this domain. Therefore, we also discuss some key points when applying DRL methods to this field, including exploration-exploitation, sample efficiency, generalization and transfer, multi-agent learning, imperfect information, and delayed spare rewards, as well as some research directions.
As part of its effort to find better ways to develop and train "safe artificial general intelligence," OpenAI has been releasing its own versions of reinforcement learning algorithms. The first is a baseline implementation called Actor Critic using Kronecker-factored Trust Region (ACKTR). Developed by researchers from the University of Toronto (UofT) and New York University (NYU), ACKTR improves on the way AI policies perform deep reinforcement learning -- learning that is accomplished only by trial and error, and obtained only through raw observation. In a paper published online, the UofT and NYU researchers used simulated robots and Atari games to test how ACKTR learns control policies. "For machine learning algorithms, two costs are important to consider: sample complexity and computational complexity," according to an OpenAI Research blog.
In TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents, a group of Boston University researchers demonstrate an attack on machine learning systems trained with "reinforcement learning" in which ML systems derive solutions to complex problems by iteratively trying multiple solutions. The attack is related to adversarial examples, a class of attacks that involve probing a machine-learning model to find "blind spots" -- very small changes (usually imperceptible to humans) that cause machine learning classifiers' accuracy to shelve off rapidly (for example, a small change to a model of a gun can make an otherwise reliable classifier think it's looking at a helicopter). It's not clear whether it's possible to create a machine learning model that's immune to adversarial examples (the expert I trust most on this told me off the record that they think it's not), but what the researchers behind Trojdrl propose is a method for deliberately introducing adversarial examples by slipping difficult-to-spot changes into training data, which will produce defects in the eventual model that can serve as a "backdoor" that future adversaries can exploit. Training data sets are often ad-hoc in nature; they're so large that it's hard to create version-by-version snapshots, and they're also so prone to mislabeling that researchers are always making changes to them in order to improve their accuracy. All of this suggests that poisoning training data might be easier than it sounds.