Reinforcement Learning
Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning
Meyer, Eivind, Robinson, Haakon, Rasheed, Adil, San, Omer
Eivind Meyer is currently working on his Master's thesis, completing his five-year integrated Master's degree in Cybernetics and Robotics at the Norwegian University of Science and Technology (NTNU) in Trondheim. Having specialized in Real Time Systems, his research interests focus on adopting state-of-the-art Artificial Intelligence methods for Autonomous Vehicle Control. Haakon Robinson is a PhD candidate at the Norwegian University of Science and Technology (NTNU). He received a Bachelors degree in Physics in 2015 and completed a Masters degree in Cybernetics and Robotics in 2019, both at NTNU. His current work investigates the overlap between modern machine learning techniques and established methods within modelling and control, with a focus on improving the interpretability and be-E Meyer et al.: Preprint submitted to Elsevier Page 15 of 16 Taming an ASV for path following and collision avoidance using DRL havioural guarantees of hybrid models that combine first principle models and data-driven components.
AI experts urge machine learning researchers to tackle climate change
At the Tackling Climate Change workshop at this year's NeurIPS conference, some of the top minds in machine learning came together to discuss the effects of climate change on life on Earth, how AI can tackle the urgent problem, and why and how the machine learning community should join the fight. The panel included Yoshua Bengio, MILA director and University of Montreal professor; Jeff Dean, Google's AI chief; Andrew Ng, cofounder of Google Brain and founder of Landing.ai; and Cornell University professor and Institute for Computational Sustainability director Carla Gomes. The Tackling Climate Change workshop explored a wide range of topics, from the use of deep reinforcement learning to improve performance for ride-hailing services like Uber and Lyft to the application of deep learning to predict wildfire risk, detect avalanche deposits, improve plane efficiency with better wind forecasts, and conduct a global census of solar farms. The workshop is put together by Climate Change AI, a group that hosts workshops at AI research conferences and a forum for collaboration between machine learning practitioners and people from other fields. One essential step in better addressing the world's pressing challenges, says Bengio, is changing the way AI research is valued.
Rule of thumb: Which AI / ML algorithms to apply to business problems
Supervised learning: You know how to classify the input data and the type of behavior you want to predict, but you need the algorithm to calculate it for you on new data Unsupervised learning: You do not know how to classify the data, and you want the algorithm to find patterns and classify the data for you Reinforcement learning: An algorithm which learns by trial and error by interacting with the environment. You use it when you don't have a lot of training data; you cannot clearly define the ideal end state; or the only way to learn about the environment is to interact with it Reinforcement learning: An algorithm which learns by trial and error by interacting with the environment. You use it when you don't have a lot of training data; you cannot clearly define the ideal end state; or the only way to learn about the environment is to interact with it
r/MachineLearning - [R] Provably Efficient Exploration in Policy Optimization
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves O (\sqrt{d 3 H 3 T}) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps.
Managing your Cryptofolio - science2innovation
Portfolio management is the act of making decisions to allocate your funds to a collection of assets for optimal dollar results. When those assets are cryptocurrencies the question is that of allocating funds to digital assets in order to maximise some crypto investment goal, for example, accumulate Bitcoin. In this paper, a reinforcement machine learning approach is built using historical data from the crypto exchange website Polonix with the goal of optimising investor gains over a set period. This model is then benchmarked against standard portfolio strategies used by traders such as buy and hold. The results show that the reinforcement learning approach is extremely effective as an investment optimisation strategy; but the authors warn that historical data is not always a valid way to predict the market.
Reflections on NeurIPs 2019
There is a huge push among the researchers here for accountability. I was presenting a poster on "Objective Mismatch in Model-based Reinforcement Learning" at the Deep RL Workshop, and the crowd was very receptive to the idea that some of our underlying assumptions of how RL works may be flawed. I also happened to be presenting my poster next to a researcher at Google pushing for more metrics of reliability in RL algorithms. This means: how consistent is the performance papers propose when they claim a new "state-of-the-art" across environments and random seeds. This realistic robustness may be the key to getting these algorithms to be more useful on real applications (such as robotics which I will always bring up as a great interpretable platform for RL).
Self-Play Learning Without a Reward Metric
Schmidt, Dan, Moran, Nick, Rosenfeld, Jonathan S., Rosenthal, Jonathan, Yedidia, Jonathan
The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.
Coordination in Adversarial Sequential Team Games via Multi-Agent Deep Reinforcement Learning
Celli, Andrea, Ciccone, Marco, Bongo, Raffaele, Gatti, Nicola
Many real-world applications involve teams of agents that have to coordinate their actions to reach a common goal against potential adversaries. This paper focuses on zero-sum games where a team of players faces an opponent, as is the case, for example, in Bridge, collusion in poker, and collusion in bidding. The possibility for the team members to communicate before gameplay---that is, coordinate their strategies ex ante---makes the use of behavioral strategies unsatisfactory. We introduce Soft Team Actor-Critic (STAC) as a solution to the team's coordination problem that does not require any prior domain knowledge. STAC allows team members to effectively exploit ex ante communication via exogenous signals that are shared among the team. STAC reaches near-optimal coordinated strategies both in perfectly observable and partially observable games, where previous deep RL algorithms fail to reach optimal coordinated behaviors.
To Follow or not to Follow: Selective Imitation Learning from Observations
Lee, Youngwoon, Hu, Edward S., Yang, Zhengyu, Lim, Joseph J.
Learning from demonstrations is a useful way to transfer a skill from one agent to another. While most imitation learning methods aim to mimic an expert skill by following the demonstration step-by-step, imitating every step in the demonstration often becomes infeasible when the learner and its environment are different from the demonstration. In this paper, we propose a method that can imitate a demonstration composed solely of observations, which may not be reproducible with the current agent. Our method, dubbed selective imitation learning from observations (SILO), selects reachable states in the demonstration and learns how to reach the selected states. Our experiments on both simulated and real robot environments show that our method reliably performs a new task by following a demonstration. Videos and code are available at https://clvrai.com/silo .