Reinforcement Learning
Mobile Cellular-Connected UAVs: Reinforcement Learning for Sky Limits
Azari, M. Mahdi, Arani, Atefeh Hajijamali, Rosas, Fernando
A cellular-connected unmanned aerial vehicle (UAV)faces several key challenges concerning connectivity and energy efficiency. Through a learning-based strategy, we propose a general novel multi-armed bandit (MAB) algorithm to reduce disconnectivity time, handover rate, and energy consumption of UAV by taking into account its time of task completion. By formulating the problem as a function of UAV's velocity, we show how each of these performance indicators (PIs) is improved by adopting a proper range of corresponding learning parameter, e.g. 50% reduction in HO rate as compared to a blind strategy. However, results reveal that the optimal combination of the learning parameters depends critically on any specific application and the weights of PIs on the final objective function.
Option Discovery in Hierarchical Reinforcement Learning using Spatio-Temporal Clustering
Srinivas, Aravind, Krishnamurthy, Ramnandan, Kumar, Peeyush, Ravindran, Balaraman
This paper introduces an automated skill acquisition framework in reinforcement learning which involves identifying a hierarchical description of the given task in terms of abstract states and extended actions between abstract states. Identifying such structures present in the task provides ways to simplify and speed up reinforcement learning algorithms. These structures also help to generalize such algorithms over multiple tasks without relearning policies from scratch. We use ideas from dynamical systems to find metastable regions in the state space and associate them with abstract states. The spectral clustering algorithm PCCA+ is used to identify suitable abstractions aligned to the underlying structure. Skills are defined in terms of the sequence of actions that lead to transitions between such abstract states. The connectivity information from PCCA+ is used to generate these skills or options. These skills are independent of the learning task and can be efficiently reused across a variety of tasks defined over the same model. This approach works well even without the exact model of the environment by using sample trajectories to construct an approximate estimate. We also present our approach to scaling the skill acquisition framework to complex tasks with large state spaces for which we perform state aggregation using the representation learned from an action conditional video prediction network and use the skill acquisition framework on the aggregated state space.
Double Prioritized State Recycled Experience Replay
Experience replay enables online reinforcement learning agents to store and reuse the previous experiences of interacting with the environment. In the original method, the experiences are sampled and replayed uniformly at random. A prior work called prioritized experience replay was developed where experiences are prioritized, so as to replay experiences seeming to be more important more frequently. In this paper, we develop a method called double-prioritized state-recycled (DPSR) experience replay, prioritizing the experiences in both training stage and storing stage, as well as replacing the experiences in the memory with state recycling to make the best of experiences that seem to have low priorities temporarily. We used this method in Deep Q-Networks (DQN), and achieved a state-of-the-art result, outperforming the original method and prioritized experience replay on many Atari games.
Optimizing for the Future in Non-Stationary MDPs
Chandak, Yash, Theocharous, Georgios, Shankar, Shiv, White, Martha, Mahadevan, Sridhar, Thomas, Philip S.
Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.
Aligning AI With Shared Human Values
Hendrycks, Dan, Burns, Collin, Basart, Steven, Critch, Andrew, Li, Jerry, Song, Dawn, Steinhardt, Jacob
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
[D] Quality Contributions Roundup 9/14
Though the community continues to develop new algorithms, state-of-the-art results have stopped improving in the last couple of years. Since RL algorithms that use a tremendous amount of online data to learn from scratch are infeasible to apply in the real-world, much research has moved to fields such as Meta-RL, offline RL, and integrating RL with domain-knowledge, integrating RL and planning, etc. How do you unit test end-to-end ML pipelines?, by u/farmingvillein As perhaps a bit of tldr: once you've got the bare minimum data-replay testing in place ("yeah, it is probably working, because the results are pretty close to what they were before"), I'd encourage you to consider focusing your energy toward thinking of testing as outlier detection. Outliers, in real-world ML systems, tend to be harbingers of things that are wrong systematically, upstream data problems, and logic (pre-/post-processing) problems. How do you transition from a no name international college to FAIR/Brain?, by u/r-sync Coming from a no-name Indian engineering college with meh grades, you do have to get a bit creative, very persistent and build credibility for yourself. The examples above are one way to do so, but you can also maybe articulate your thoughts as really good blog posts and arxiv papers, or show great software engineering skills in open-source (i.e.
Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms
EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.
RL STaR Platform: Reinforcement Learning for Simulation based Training of Robots
Blum, Tamir, Paillet, Gabin, Laine, Mickael, Yoshida, Kazuya
Reinforcement learning (RL) is a promising field to enhance robotic autonomy and decision making capabilities for space robotics, something which is challenging with traditional techniques due to stochasticity and uncertainty within the environment. RL can be used to enable lunar cave exploration with infrequent human feedback, faster and safer lunar surface locomotion or the coordination and collaboration of multi-robot systems. However, there are many hurdles making research challenging for space robotic applications using RL and machine learning, particularly due to insufficient resources for traditional robotics simulators like CoppeliaSim. Our solution to this is an open source modular platform called Reinforcement Learning for Simulation based Training of Robots, or RL STaR, that helps to simplify and accelerate the application of RL to the space robotics research field. This paper introduces the RL STaR platform, and how researchers can use it through a demonstration.
Latent Representation Prediction Networks
Hlynsson, Hlynur Davíð, Schüler, Merlin, Schiewer, Robin, Glasmachers, Tobias, Wiskott, Laurenz
Deeply-learned planning methods are often based on learning representations that are optimized for unrelated tasks. For example, they might be trained on reconstructing the environment. These representations are then combined with predictor functions for simulating rollouts to navigate the environment. We find this principle of learning representations unsatisfying and propose to learn them such that they are directly optimized for the task at hand: to be maximally predictable for the predictor function. This results in representations that are by design optimal for the downstream task of planning, where the learned predictor function is used as a forward model. To this end, we propose a new way of jointly learning this representation along with the prediction function, a system we dub Latent Representation Prediction Network (LARP). The prediction function is used as a forward model for search on a graph in a viewpoint-matching task and the representation learned to maximize predictability is found to outperform a pre-trained representation. Our approach is shown to be more sample-efficient than standard reinforcement learning methods and our learned representation transfers successfully to dissimilar objects.
FlexPool: A Distributed Model-Free Deep Reinforcement Learning Algorithm for Joint Passengers & Goods Transportation
Manchella, Kaushik, Umrawal, Abhishek K., Aggarwal, Vaneet
The growth in online goods delivery is causing a dramatic surge in urban vehicle traffic from last-mile deliveries. On the other hand, ride-sharing has been on the rise with the success of ride-sharing platforms and increased research on using autonomous vehicle technologies for routing and matching. The future of urban mobility for passengers and goods relies on leveraging new methods that minimize operational costs and environmental footprints of transportation systems. This paper considers combining passenger transportation with goods delivery to improve vehicle-based transportation. Even though the problem has been studied with a defined dynamics model of the transportation system environment, this paper considers a model-free approach that has been demonstrated to be adaptable to new or erratic environment dynamics. We propose FlexPool, a distributed model-free deep reinforcement learning algorithm that jointly serves passengers & goods workloads by learning optimal dispatch policies from its interaction with the environment. The proposed algorithm pools passengers for a ride-sharing service and delivers goods using a multi-hop transit method. These flexibilities decrease the fleet's operational cost and environmental footprint while maintaining service levels for passengers and goods. Through simulations on a realistic multi-agent urban mobility platform, we demonstrate that FlexPool outperforms other model-free settings in serving the demands from passengers & goods. FlexPool achieves 30% higher fleet utilization and 35% higher fuel efficiency in comparison to (i) model-free approaches where vehicles transport a combination of passengers & goods without the use of multi-hop transit, and (ii) model-free approaches where vehicles exclusively transport either passengers or goods.