Reinforcement Learning
UAV Path Planning for Wireless Data Harvesting: A Deep Reinforcement Learning Approach
Bayerlein, Harald, Theile, Mirco, Caccamo, Marco, Gesbert, David
Autonomous deployment of unmanned aerial vehicles (UAVs) supporting next-generation communication networks requires efficient trajectory planning methods. We propose a new end-to-end reinforcement learning (RL) approach to UAV-enabled data collection from Internet of Things (IoT) devices in an urban environment. An autonomous drone is tasked with gathering data from distributed sensor nodes subject to limited flying time and obstacle avoidance. While previous approaches, learning and non-learning based, must perform expensive recomputations or relearn a behavior when important scenario parameters such as the number of sensors, sensor positions, or maximum flying time, change, we train a double deep Q-network (DDQN) with combined experience replay to learn a UAV control policy that generalizes over changing scenario parameters. By exploiting a multi-layer map of the environment fed through convolutional network layers to the agent, we show that our proposed network architecture enables the agent to make movement decisions for a variety of scenario parameters that balance the data collection goal with flight time efficiency and safety constraints. Considerable advantages in learning efficiency from using a map centered on the UAV's position over a non-centered map are also illustrated.
Refactoring Policy for Compositional Generalizability using Self-Supervised Object Proposals
Mu, Tongzhou, Gu, Jiayuan, Jia, Zhiwei, Tang, Hao, Su, Hao
We study how to learn a policy with compositional generalizability. We propose a two-stage framework, which refactorizes a high-reward teacher policy into a generalizable student policy with strong inductive bias. Particularly, we implement an object-centric GNN-based student policy, whose input objects are learned from images through self-supervised learning. Empirically, we evaluate our approach on four difficult tasks that require compositional generalizability, and achieve superior performance compared to baselines.
Graph-based Reinforcement Learning for Active Learning in Real Time: An Application in Modeling River Networks
Jia, Xiaowei, Lin, Beiyu, Zwart, Jacob, Sadler, Jeffery, Appling, Alison, Oliver, Samantha, Read, Jordan
Effective training of advanced ML models requires large amounts of labeled data, which is often scarce in scientific problems given the substantial human labor and material cost to collect labeled data. This poses a challenge on determining when and where we should deploy measuring instruments (e.g., in-situ sensors) to collect labeled data efficiently. This problem differs from traditional pool-based active learning settings in that the labeling decisions have to be made immediately after we observe the input data that come in a time series. In this paper, we develop a real-time active learning method that uses the spatial and temporal contextual information to select representative query samples in a reinforcement learning framework. To reduce the need for large training data, we further propose to transfer the policy learned from simulation data which is generated by existing physics-based models. We demonstrate the effectiveness of the proposed method by predicting streamflow and water temperature in the Delaware River Basin given a limited budget for collecting labeled data. We further study the spatial and temporal distribution of selected samples to verify the ability of this method in selecting informative samples over space and time.
MELD: Meta-Reinforcement Learning from Images via Latent State Models
Zhao, Tony Z., Nagabandi, Anusha, Rakelly, Kate, Finn, Chelsea, Levine, Sergey
Meta-reinforcement learning algorithms can enable autonomous agents, such as robots, to quickly acquire new behaviors by leveraging prior experience in a set of related training tasks. However, the onerous data requirements of meta-training compounded with the challenge of learning from sensory inputs such as images have made meta-RL challenging to apply to real robotic systems. Latent state models, which learn compact state representations from a sequence of observations, can accelerate representation learning from visual inputs. In this paper, we leverage the perspective of meta-learning as task inference to show that latent state models can \emph{also} perform meta-learning given an appropriately defined observation space. Building on this insight, we develop meta-RL with latent dynamics (MELD), an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills given observations and rewards. MELD outperforms prior meta-RL methods on several simulated image-based robotic control problems, and enables a real WidowX robotic arm to insert an Ethernet cable into new locations given a sparse task completion signal after only $8$ hours of real world meta-training. To our knowledge, MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.
Forethought and Hindsight in Credit Assignment
Chelu, Veronica, Precup, Doina, van Hasselt, Hado
We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)- evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.
Energy and Service-priority aware Trajectory Design for UAV-BSs using Double Q-Learning
Hoseini, Sayed Amir, Bokani, Ayub, Hassan, Jahan, Salehi, Shavbo, Kanhere, Salil S.
Next-generation mobile networks have proposed the integration of Unmanned Aerial Vehicles (UAVs) as aerial base stations (UAV-BS) to serve ground nodes. Despite having advantages of using UAV-BSs, their dependence on the on-board, limited-capacity battery hinders their service continuity. Shorter trajectories can save flying energy, however, UAV-BSs must also serve nodes based on their service priority since nodes' service requirements are not always the same. In this paper, we present an energy-efficient trajectory optimization for a UAV assisted IoT system in which the UAV-BS considers the IoT nodes' service priorities in making its movement decisions. We solve the trajectory optimization problem using Double Q-Learning algorithm. Simulation results reveal that the Q-Learning based optimized trajectory outperforms a benchmark algorithm, namely Greedily-served algorithm, in terms of reducing the average energy consumption of the UAV-BS as well as the service delay for high priority nodes.
Understanding Reinforcement Learning Hands-On: The Bellman Equation pt.1
Welcome to the fifth entry on a series on Reinforcement Learning. In the previous article, we presented the MDP Framework for describing complex environments. This allowed us to create a more robust and diverse scenario for the basic Multi-Armed Bandits problem, which we called the Casinos Environment. We then implemented this scenario using OpenAI's gym, and made a simple agent that acted randomly to showcase how an interaction is realized under the MDP Framework. Today, we're going to focus back on the agents, and show a way in which we can describe an agent's behavior in complex scenarios, where past actions determine future rewards.
Can Animals Help Us Build Better AI?
Machine learning has been making plenty of headlines in the past few years. Rightfully so, even though headlines tend to oversell. Advances in computing power, algorithmic complexity, data handling capacities, and models of learning mean that machine learning/AI is increasingly being used in many fields. In previous posts, I have written about machine learning/AI in general science and art, but also more specifically in (warning, link fest) historical research, genetic enhancement, mental health, aging research (including the development of'aging clocks'), video game ecology, Hollywood, astrobiology, epidemiology, stock markets, and the job market. Plenty of AI to go around, it seems.
Robust Hierarchical Planning with Policy Delegation
We propose a novel framework and algorithm for hierarchical planning based on the principle of delegation. This framework, the Markov Intent Process, features a collection of skills which are each specialised to perform a single task well. Skills are aware of their intended effects and are able to analyse planning goals to delegate planning to the best-suited skill. This principle dynamically creates a hierarchy of plans, in which each skill plans for sub-goals for which it is specialised. The proposed planning method features on-demand execution---skill policies are only evaluated when needed. Plans are only generated at the highest level, then expanded and optimised when the latest state information is available. The high-level plan retains the initial planning intent and previously computed skills, effectively reducing the computation needed to adapt to environmental changes. We show this planning approach is experimentally very competitive to classic planning and reinforcement learning techniques on a variety of domains, both in terms of solution length and planning time.
Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs
Du, Jianzhun, Futoma, Joseph, Doshi-Velez, Finale
We present two elegant solutions for modeling continuous-time dynamics, in a novel model-based reinforcement learning (RL) framework for semi-Markov decision processes (SMDPs), using neural ordinary differential equations (ODEs). Our models accurately characterize continuous-time dynamics and enable us to develop high-performing policies using a small amount of data. We also develop a model-based approach for optimizing time schedules to reduce interaction rates with the environment while maintaining the near-optimal performance, which is not possible for model-free methods. We experimentally demonstrate the efficacy of our methods across various continuous-time domains.