van seijen
Review for NeurIPS paper: Munchausen Reinforcement Learning
Additional Feedback: After Authors' Reponse: I still find the paper's analysis regarding action-gaps a bit weak, and the authors' response didn't help much in that regard. I think their action-gap analysis needs to be considered under the new findings of (van Seijen et al., 2019); increasing the action-gap is not important on its own, rather it's the homogeneity of the action-gaps across the states that is important. While I still stand by my verdict of accepting this paper, in light of other reviews, I think the paper's writing should be toned down a bit in regards to its theoretical novelty and claims about empirical results (e.g. the first non-dist-RL to beat a dist-RL). Q1: To the best of my knowledge, IQN in Dopamine also uses Double Q-learning. Is this also the case for your M-IQN agent?
van Seijen
This paper introduces a novel approach for abstraction selection in reinforcement learning problems modelled as factored Markov decision processes (MDPs), for which a state is described via a set of state components. In abstraction selection, an agent must choose an abstraction from a set of candidate abstractions, each build up from a different combination of state components.
Supplementary material for Uncorrected least-squares temporal difference with lambda-return
November 15, 2019 Abstract Here, we provide a supplementary material for Takayuki Osogami, "Uncorrected least-squares temporal difference with lambda-return," which appears in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) [Osogami, 2019]. A Proofs In this section, we prove Theorem 1, Lemma 1, Theorem 2, Lemma 2, and Proposition 1. Note that equations (1)-(19) refers to those in Osogami [2019]. A.1 Proof of Theorem 1 From (7)-(8), we have the following equality: A Unc T 1 T null t 0φ t null φ t (1 λ) γ T t null m 1( λγ) m 1 φ t mnull null (20) T 1 null t 0φ t null φ t (1 λ) γ T t null m 1(λγ) m 1 φ t mnull null φ T φ null T (21) T 1 null t 0φ tnull φ t (1 λ) γ T t 1 null m 1(λγ) m 1 φ t m (1 λ) γ (λγ) T t 1 φ Tnull null φ T φ null T (22) A Unc T T 1 null t 0φ t(1 λ) γ (λγ) T t 1 φ null T φ T φ null T (23) A Unc T null T null t 0(λγ) T t φ tnull φ null T γ null T 1 null t 0( λγ) T t 1 φ tnull φ null T (24) A Unc T ( z T γ z T 1) φ null T . The recursive computation of the eligibility trace can be verified in a straightforward manner.
Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation
Corneil, Dane, Gerstner, Wulfram, Brea, Johanni
Modern reinforcement learning algorithms reach super-human performance in many board and video games, but they are sample inefficient, i.e. they typically require significantly more playing experience than humans to reach an equal performance level. To improve sample efficiency, an agent may build a model of the environment and use planning methods to update its policy. In this article we introduce VaST (Variational State Tabulation), which maps an environment with a high-dimensional state space (e.g. the space of visual inputs) to an abstract tabular environment. Prioritized sweeping with small backups, a highly efficient planning method, can then be used to update state-action values. We show how VaST can rapidly learn to maximize reward in tasks like 3D navigation and efficiently adapt to sudden changes in rewards or transition probabilities.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)
A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning
Yang, Long, Shi, Minhao, Zheng, Qian, Meng, Wenjia, Pan, Gang
Recently, a new multi-step temporal learning algorithm, called $Q(\sigma)$, unifies $n$-step Tree-Backup (when $\sigma=0$) and $n$-step Sarsa (when $\sigma=1$) by introducing a sampling parameter $\sigma$. However, similar to other multi-step temporal-difference learning algorithms, $Q(\sigma)$ needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original $Q(\sigma)$, combine it with eligibility traces and propose a new algorithm, called $Q(\sigma ,\lambda)$, in which $\lambda$ is trace-decay parameter. This idea unifies Sarsa$(\lambda)$ (when $\sigma =1$) and $Q^{\pi}(\lambda)$ (when $\sigma =0$). Furthermore, we give an upper error bound of $Q(\sigma ,\lambda)$ policy evaluation algorithm. We prove that $Q(\sigma,\lambda)$ control algorithm can converge to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of $\sigma$, $Q(\sigma ,\lambda)$ creates a mixture of the existing algorithms that can learn the optimal value significantly faster than the extreme end ($\sigma=0$, or $1$).
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Apulia > Bari (0.04)
- Asia > Middle East > Jordan (0.04)
Multi-Advisor Reinforcement Learning
Laroche, Romain, Fatemi, Mehdi, Romoff, Joshua, van Seijen, Harm
We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the literature is flawless: the egocentric planning overestimates values of states where the other advisors disagree, and the agnostic planning is inefficient around danger zones. We introduce a novel approach called empathic and discuss its theoretical aspects. We empirically examine and validate our theoretical findings on a fruit collection task.
- Research Report (0.84)
- Overview (0.66)
It can't write this story yet, but Microsoft has trained AI to win Ms. Pac-Man
In the latest sign of artificial intelligence (AI)'s eventual dominance of the workplace, a Canadian deep learning startup-turned-division of Microsoft Corp. has successfully created an AI-based system that achieved the maximum possible score on Ms. Pac-Man. That might not sound like the most complicated task in the world – especially since the edition in question was the Atari 2600 version and not the arcade original – but as Microsoft senior writer Allison Linn explains in a recent blog post, the challenge facing researchers at Montreal-based Maluuba was more daunting than you might think. "A lot of companies working on AI use games to build intelligent algorithms because there's a lot of human-like intelligence capabilities that you need to beat the games," Maluuba program manager Rahul Mehrotra explains in the story, noting that the variety of situations you can encounter while playing the games makes them a good testing ground. In other words, the techniques used to develop the AI-driven Ms. Pac-Man master (or is that mistress?) Like many of its ilk, Ms. Pac-Man was intentionally designed to be easy to learn yet nearly impossible to master so that players would keep dropping in quarters, with co-creator Steve Golson noting that Ms. Pac-Man in particular was programmed to be more random than the original Pac-Man, so it would be harder for players to finish.
AI computer gets first ever perfect score on Ms. Pac-Man
While it might sound like an elusive dream for most, the perfect score for arcade classic Ms. Pac-Man has been achieved – albeit by a computer. Researchers have created an artificial intelligence-based system that learned how to get the maximum score of 999,990 on the addictive 1980s video game. And the innovative method used could help to make advances in other areas of AI research, such as natural language processing. Researchers have created an artificial intelligence-based system that learned how to get the maximum score of 999,990 on the addictive 1980s video game, Ms. Pac-Man The technique, which the team has named'Hybrid Reward Architecture', used 150 agents, which worked in parallel with one another. For example, some agents were rewarded for successfully finding one specific pellet, while others were tasked with staying out of the way of ghosts.