Reinforcement Learning
Adversarial Imitation Learning via Random Search
Shin, MyungJae, Kim, Joongheon
Developing agents that can perform challenging complex tasks is the goal of reinforcement learning. The model-free reinforcement learning has been considered as a feasible solution. However, the state of the art research has been to develop increasingly complicated techniques. This increasing complexity makes the reconstruction difficult. Furthermore, the problem of reward dependency is still exists. As a result, research on imitation learning, which learns policy from a demonstration of experts, has begun to attract attention. Imitation learning directly learns policy based on data on the behavior of the experts without the explicit reward signal provided by the environment. However, imitation learning tries to optimize policies based on deep reinforcement learning such as trust region policy optimization. As a result, deep reinforcement learning based imitation learning also poses a crisis of reproducibility. The issue of complex model-free model has received considerable critical attention. A derivative-free optimization based reinforcement learning and the simplification on policies obtain competitive performance on the dynamic complex tasks. The simplified policies and derivative free methods make algorithm be simple. The reconfiguration of research demo becomes easy. In this paper, we propose an imitation learning method that takes advantage of the derivative-free optimization with simple linear policies. The proposed method performs simple random search in the parameter space of policies and shows computational efficiency. Experiments in this paper show that the proposed model, without a direct reward signal from the environment, obtains competitive performance on the MuJoCo locomotion tasks.
Audio-Visual Waypoints for Navigation
Chen, Changan, Majumder, Sagnik, Al-Halah, Ziad, Gao, Ruohan, Ramakrishnan, Santhosh Kumar, Grauman, Kristen
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements 1) audio-visual waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on the challenging Replica environments of real-world 3D scenes. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.
Multi-Agent Reinforcement Learning with Graph Clustering
Zhou, Tianze, Zhang, Fubiao, Wang, Chenfei
In this paper, we introduce the group concept into multi-agent reinforcement learning. In this method, agents are divided into several groups and each group completes a specific subtask so that agents can cooperate to complete the main task. Existing methods use the communication vector to exchange information between agents. This may encounter communication redundancy. To solve this problem, we propose a MARL method based on graph clustering. It allows agents to adaptively learn group features and replaces the communication operation. In our method, agent features are divide into two types, including in-group features and individual features. They represent the generality and differences between agents, respectively. Based on the graph attention network(GAT), we introduce the graph clustering method as a punishment to optimize agent group feature. Then these features are used to generate individual Q value. To overcome the consistent problem brought by GAT, we introduce the split loss to distinguish agent features. Our method is easy to convert into the CTDE framework via using Kullback-Leibler divergence method. Empirical results are evaluated on a challenging set of StarCraft II micromanagement tasks. The result shows that our method outperforms existing multi-agent reinforcement learning methods and the performance increases with the number of agents increasing.
Expressing Diverse Human Driving Behavior with Probabilistic Rewards and Online Inference
Sun, Liting, Wu, Zheng, Ma, Hengbo, Tomizuka, Masayoshi
In human-robot interaction (HRI) systems, such as autonomous vehicles, understanding and representing human behavior are important. Human behavior is naturally rich and diverse. Cost/reward learning, as an efficient way to learn and represent human behavior, has been successfully applied in many domains. Most of traditional inverse reinforcement learning (IRL) algorithms, however, cannot adequately capture the diversity of human behavior since they assume that all behavior in a given dataset is generated by a single cost function.In this paper, we propose a probabilistic IRL framework that directly learns a distribution of cost functions in continuous domain. Evaluations on both synthetic data and real human driving data are conducted. Both the quantitative and subjective results show that our proposed framework can better express diverse human driving behaviors, as well as extracting different driving styles that match what human participants interpret in our user study.
A Composable Specification Language for Reinforcement Learning Tasks
Jothimurugan, Kishor, Alur, Rajeev, Bastani, Osbert
Reinforcement learning is a promising approach for learning control policies for robot tasks. However, specifying complex tasks (e.g., with multiple objectives and safety constraints) can be challenging, since the user must design a reward function that encodes the entire task. Furthermore, the user often needs to manually shape the reward to ensure convergence of the learning algorithm. We propose a language for specifying complex control tasks, along with an algorithm that compiles specifications in our language into a reward function and automatically performs reward shaping. We implement our approach in a tool called SPECTRL, and show that it outperforms several state-of-the-art baselines.
Explainability in Deep Reinforcement Learning
Heuillet, Alexandre, Couthouis, Fabien, Dรญaz-Rodrรญguez, Natalia
During the past decade, Artificial Intelligence (AI), and by extension Machine Learning (ML), have seen an unprecedented rise in both industry and research. The progressive improvement of computer hardware associated with the need to process larger and larger amounts of data made these underestimated techniques shine under a new light. Reinforcement Learning (RL) focuses on learning how to map situations to actions, in order to maximize a numerical reward signal [102]. The learner is not told which actions to take, but instead must discover which actions are the most rewarding by trying them. Reinforcement learning addresses the problem of how agents should learn a policy that take actions to maximize the cumulative reward through interaction with the environment [31]. Recent progress in Deep Learning (DL) for learning feature representations has significantly impacted RL, and the combination of both methods (known as deep RL) has led to remarkable results in a lot of areas. Typically, RL is used to solve optimisation problems when the system has a very large number of states and has a complex stochastic structure. Notable examples include training agents to play Atari games based on raw pixels [75, 76], board games [96, 97], complex real-world robotics problems such as manipulation [8] or grasping [54] and other real-world applications such as resource management in computer clusters [72], network traffic signal control [9], chemical reactions optimization [117] or recommendation systems [116].
Imitation Learning with Sinkhorn Distances
Papagiannis, Georgios, Li, Yunpeng
Imitation learning algorithms have been interpreted as variants of divergence minimization problems. The ability to compare occupancy measures between experts and learners is crucial in their effectiveness in learning from demonstrations. In this paper, we present tractable solutions by formulating imitation learning as minimization of the Sinkhorn distance between occupancy measures. The formulation combines the valuable properties of optimal transport metrics in comparing non-overlapping distributions with a cosine distance cost defined in an adversarially learned feature space. This leads to a highly discriminative critic network and optimal transport plan that subsequently guide imitation learning. We evaluate the proposed approach using both the reward metric and the Sinkhorn distance metric on a number of MuJoCo experiments.
Reinforcement Learning based dynamic weighing of Ensemble Models for Time Series Forecasting
Perepu, Satheesh K., Balaji, Bala Shyamala, Tanneru, Hemanth Kumar, Kathari, Sudhakar, Pinnamaraju, Vivek Shankar
Ensemble models are powerful model building tools that are developed with a focus to improve the accuracy of model predictions. They find applications in time series forecasting in varied scenarios including but not limited to process industries, health care, and economics where a single model might not provide optimal performance. It is known that if models selected for data modelling are distinct (linear/non-linear, static/dynamic) and independent (minimally correlated models), the accuracy of the predictions is improved. Various approaches suggested in the literature to weigh the ensemble models use a static set of weights. Due to this limitation, approaches using a static set of weights for weighing ensemble models cannot capture the dynamic changes or local features of the data effectively. To address this issue, a Reinforcement Learning (RL) approach to dynamically assign and update weights of each of the models at different time instants depending on the nature of data and the individual model predictions is proposed in this work. The RL method implemented online, essentially learns to update the weights and reduce the errors as the time progresses. Simulation studies on time series data showed that the dynamic weighted approach using RL learns the weight better than existing approaches. The accuracy of the proposed method is compared with an existing approach of online Neural Network tuning quantitatively through normalized mean square error(NMSE) values.
Optimization of operation parameters towards sustainable WWTP based on deep reinforcement learning
Chena, Kehua, Wang, Hongcheng, Perezc, Borja Valverde, Vezzaro, Luca, Wang, Aijie
A large amount of wastewater has been produced nowadays. Wastewater treatment plants (WWTPs) are designed to eliminate pollutants and alleviate environmental pollution resulting from human activities. However, the construction and operation of WWTPs still have negative impacts. WWTPs are complex to control and optimize because of high nonlinearity and variation. This study used a novel technique, multi-agent deep reinforcement learning (DRL), to optimize dissolved oxygen (DO) and dosage in a hypothetical WWTP. The reward function is specially designed as LCA-based form to achieve sustainability optimization. Four scenarios: baseline, LCA-oriented, cost-oriented and effluent-oriented are considered. The result shows that optimization based on LCA has lowest environmental impacts. The comparison of different SRT indicates that a proper SRT can reduce negative impacts greatly. It is worth mentioning that the retrofitting of WWTPs should be implemented with the consideration of other environmental impacts except cost. Moreover, the comparison between DRL and genetic algorithm (GA) indicates that DRL can solve optimization problems effectively and has great extendibility. In a nutshell, there are still limits and shortcomings of this work, future studies are required.