Reinforcement Learning
Active Domain Randomization
Mehta, Bhairav, Diaz, Manfred, Golemo, Florian, Pal, Christopher J., Paull, Liam
Domain randomization is a popular technique for improving domain transfer, often used in a zero-shot setting when the target domain is unknown or cannot easily be used for training. In this work, we empirically examine the effects of domain randomization on agent generalization. Our experiments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters. We propose Active Domain Randomization, a novel algorithm that learns a parameter sampling strategy. Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization. In addition, when domain randomization and policy transfer fail, Active Domain Randomization offers more insight into the deficiencies of both the chosen parameter ranges and the learned policy, allowing for more focused debugging. Our experiments across various physics-based simulated and a real-robot task show that this enhancement leads to more robust, consistent policies.
Bounded rational decision-making from elementary computations that reduce uncertainty
Gottwald, Sebastian, Braun, Daniel A.
In its most basic form, decision-making can be viewed as a computational process that progressively eliminates alternatives, thereby reducing uncertainty. Such processes are generally costly, meaning that the amount of uncertainty that can be reduced is limited by the amount of available computational resources. Here, we introduce the notion of elementary computation based on a fundamental principle for probability transfers that reduce uncertainty. Elementary computations can be considered as the inverse of Pigou-Dalton transfers applied to probability distributions, closely related to the concepts of majorization, T-transforms, and generalized entropies that induce a preorder on the space of probability distributions. As a consequence we can define resource cost functions that are order-preserving and therefore monotonic with respect to the uncertainty reduction. This leads to a comprehensive notion of decision-making processes with limited resources. Along the way, we prove several new results on majorization theory, as well as on entropy and divergence measures.
"Jam Me If You Can'': Defeating Jammer with Deep Dueling Neural Network Architecture and Ambient Backscattering Augmented Communications
Van Huynh, Nguyen, Nguyen, Diep N., Hoang, Dinh Thai, Dutkiewicz, Eryk
With conventional anti-jamming solutions like frequency hopping or spread spectrum, legitimate transceivers often tend to "escape" or "hide" themselves from jammers. These reactive anti-jamming approaches are constrained by the lack of timely knowledge of jamming attacks. Bringing together the latest advances in neural network architectures and ambient backscattering communications, this work allows wireless nodes to effectively "face" the jammer by first learning its jamming strategy, then adapting the rate or transmitting information right on the jamming signal. Specifically, to deal with unknown jamming attacks, existing work often relies on reinforcement learning algorithms, e.g., Q-learning. However, the Q-learning algorithm is notorious for its slow convergence to the optimal policy, especially when the system state and action spaces are large. This makes the Q-learning algorithm pragmatically inapplicable. To overcome this problem, we design a novel deep reinforcement learning algorithm using the recent dueling neural network architecture. Our proposed algorithm allows the transmitter to effectively learn about the jammer and attain the optimal countermeasures thousand times faster than that of the conventional Q-learning algorithm. Through extensive simulation results, we show that our design (using ambient backscattering and the deep dueling neural network architecture) can improve the average throughput by up to 426% and reduce the packet loss by 24%. By augmenting the ambient backscattering capability on devices and using our algorithm, it is interesting to observe that the (successful) transmission rate increases with the jamming power. Our proposed solution can find its applications in both civil (e.g., ultra-reliable and low-latency communications or URLLC) and military scenarios (to combat both inadvertent and deliberate jamming).
Creating Pro-Level AI for Real-Time Fighting Game with Deep Reinforcement Learning
Oh, Inseok, Rho, Seungeun, Moon, Sangbin, Son, Seongho, Lee, Hyoil, Chung, Jinyun
Reinforcement learning combined with deep neural networks has performed remarkably well in many genres of game recently. It surpassed human-level performance in fixed game environments and turn-based two player board games. However, no research has ever shown a result that surpassed human level in modern complex fighting games, to the best of our knowledge. This is due to the inherent difficulties of modern fighting games, including vast action spaces, real-time constraints, and performance generalizations required for various opponents. We overcame these challenges and made 1v1 battle AI agents for the commercial game, "Blade & Soul". The trained agents competed against five professional gamers and achieved 62% of win rate.This paper presents a practical reinforcement learning method including a novel self-play curriculum and data skipping techniques. Through the curriculum, three different styles of agents are created by reward shaping, and are trained against each other for robust performance. Additionally, this paper suggests data skipping techniques which increased data efficiency and facilitated explorations in vast spaces.
Amir Barati Farimani: Creative Robots with Deep Reinforcement Learning CMU RI Seminar
Recent advances in Deep Reinforcement Learning (DRL) algorithms provided us with the possibility of adding intelligence to robots. Recently, we have been applying a variety of DRL algorithms to the tasks that modern control theory may not be able to solve. We observed intriguing creativity from robots when they are constrained in reaching a certain goal. To introduce the topic, I will talk about some of the experiments that are being done to show the capabilities and limitations of modern Deep Reinforcement Learning approaches, including those of sparse rewards and continuous observations and action spaces. An in depth explanation of how Hindsight Experience Replay (HER) has been used to obtain dense results from sparse environments when using Deep Deterministic Policy Gradient (DDPG) agents will be given. I will then show how we have modified some of these experiments to have a deeper understanding of the intelligence we are developing, and what are the baseline environmental characteristics that make the robots achieve higher levels of creativity during their problem solving scenarios.
Randomised Bayesian Least-Squares Policy Iteration
Tziortziotis, Nikolaos, Dimitrakakis, Christos, Vazirgiannis, Michalis
We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the exploration-exploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated experimentally in several environments.
Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in model-based and model-free scheduling methodologies
Tordesillas, Jesus, Arbelaiz, Juncal
Reinforcement learning algorithms are gaining popularity in fields where optimal scheduling is important, and oncology is not an exception. The complex and uncertain dynamics of cancer limit the performance of traditional model-based scheduling strategies like Optimal Control. Motivated by the recent success of model-free Deep Reinforcement Learning (DRL) in challenging control tasks and in medical treatments, we use Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) to design a personalized cancer chemotherapy schedule. We show that both of them succeed in the task and outperform the Optimal Control solution in the presence of uncertainty. Furthermore, we show that DDPG can exterminate cancer more efficiently than DQN due to its continuous action space. Finally, we provide some intuition regarding the amount of samples required for the training.
Reinforced Imitation in Heterogeneous Action Space
Zolna, Konrad, Rostamzadeh, Negar, Bengio, Yoshua, Ahn, Sungjin, Pinheiro, Pedro O.
Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a challenging setting where an agent and an expert use different actions from each other. We assume that the agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learning cost and the reinforcement learning objective. In addition, this method adapts the agent's policy based on either mimicking expert behavior or maximizing sparse reward. We show, through navigation scenarios, that (i) an agent is able to efficiently leverage sparse rewards to outperform standard state-only imitation learning, (ii) it can learn a policy even when its actions are different from the expert, and (iii) the performance of the agent is not bounded by that of the expert, due to the optimized usage of sparse rewards.
Reinforcement Learning with Attention that Works: A Self-Supervised Approach
Manchin, Anthony, Abbasnejad, Ehsan, Hengel, Anton van den
Attention models have had a significant positive impact on deep learning across a range of tasks. However previous attempts at integrating attention with reinforcement learning have failed to produce significant improvements. We propose the first combination of self attention and reinforcement learning that is capable of producing significant improvements, including new state of the art results in the Arcade Learning Environment. Unlike the selective attention models used in previous attempts, which constrain the attention via preconceived notions of importance, our implementation utilises the Markovian properties inherent in the state input. Our method produces a faithful visualisation of the policy, focusing on the behaviour of the agent. Our experiments demonstrate that the trained policies use multiple simultaneous foci of attention, and are able to modulate attention over time to deal with situations of partial observability.
Monte Carlo Neural Fictitious Self-Play: Approach to Approximate Nash equilibrium of Imperfect-Information Games
Zhang, Li, Wang, Wei, Li, Shijian, Pan, Gang
Researchers on artificial intelligence have achieved human-level intelligence in large-scale perfect-information games, but it is still a challenge to achieve (nearly) optimal results (in other words, an approximate Nash Equilibrium) in large-scale imperfect-information games (i.e. war games, football coach or business strategies). Neural Fictitious Self Play (NFSP) is an effective algorithm for learning approximate Nash equilibrium of imperfect-information games from self-play without prior domain knowledge. However, it relies on Deep Q-Network, which is off-line and is hard to converge in online games with changing opponent strategy, so it can't approach approximate Nash equilibrium in games with large search scale and deep search depth. In this paper, we propose Monte Carlo Neural Fictitious Self Play (MC-NFSP), an algorithm combines Monte Carlo tree search with NFSP, which greatly improves the performance on large-scale zero-sum imperfect-information games. Experimentally, we demonstrate that the proposed Monte Carlo Neural Fictitious Self Play can converge to approximate Nash equilibrium in games with large-scale search depth while the Neural Fictitious Self Play can't. Furthermore, we develop Asynchronous Neural Fictitious Self Play (ANFSP). It use asynchronous and parallel architecture to collect game experience. In experiments, we show that parallel actor-learners have a further accelerated and stabilizing effect on training.