Goto

Collaborating Authors

 exploration game



Review for NeurIPS paper: Non-Crossing Quantile Regression for Distributional Reinforcement Learning

Neural Information Processing Systems

Weaknesses: - Baseline algorithm: While all quantile-based distributional RL algorithms suffer from the crossing quantile issue, QR-DQN is the least affected one since the quantiles are uniformly fixed. IQN[1], which uses randomly sampled quantiles, and FQF[2], which optimizes over chosen quantiles for better distribution approximation, are both expected to suffer much more from crossing quantiles than QR-DQN. While it may be non-trivial to adapt NC architecture to IQN since the quantiles are randommly sampled, it shouldn't be hard to adapt to FQF. Besides, IQN and FQF both have achieved much higher scores than QR-DQN, hence I believe implementing NC architecture on IQN and FQF would greatly strenghthen empirical validations. Can authors explain why only 49 out of 57 games are used for evaluation? - Number of quantiles: I believe that N 100 quantiles is a reasonable choice.


Ensuring AI works with the right dose of curiosity

#artificialintelligence

Friday night has rolled around, and you're trying to pick a restaurant for dinner. Should you visit your most beloved watering hole or try a new establishment, in the hopes of discovering something superior? Potentially, but that curiosity comes with a risk: If you explore the new option, the food could be worse. On the flip side, if you stick with what you know works well, you won't grow out of your narrow pathway. Curiosity drives artificial intelligence to explore the world, now in boundless use cases -- autonomous navigation, robotic decision-making, optimizing health outcomes, and more.


The Joy of Walking in Games

WIRED

When the world locked down, I chose to walk. In a world that was slowly closing in, wandering the vast landscapes of walking simulator games felt like a release. I immersed myself in the lives of others: people who were on journeys of their own, my outer and inner worlds blending into one. I haven't been alone, either. People turned to video games in droves during the pandemic, and game companies recorded record profits.


Temporal Difference Uncertainties as a Signal for Exploration

arXiv.org Artificial Intelligence

An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a "curriculum" that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard-exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration. Striking the right balance between exploration and exploitation is fundamental to the reinforcement learning problem. A common approach is to derive exploration from the policy being learned. Dithering strategies, such as ɛ-greedy exploration, render a reward-maximising policy stochastic around its reward maximising behaviour (Williams & Peng, 1991). Other methods encourage higher entropy in the policy (Ziebart et al., 2008), introduce an intrinsic reward (Singh et al., 2005), or drive exploration by sampling from the agent's belief over the MDP (Strens, 2000). While greedy or entropy-maximising policies cannot facilitate temporally extended exploration (Osband et al., 2013; 2016a), the efficacy of intrinsic rewards depends crucially on how they relate to the extrinsic reward that comes from the environment (Burda et al., 2018a).


Playing hard exploration games by watching YouTube

Neural Information Processing Systems

Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent’s exact environment setup and the demonstrator’s action and reward trajectories. Here we propose a method that overcomes these limitations in two stages. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to learn a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma’s Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.


Playing hard exploration games by watching YouTube

Neural Information Processing Systems

Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent’s exact environment setup and the demonstrator’s action and reward trajectories. Here we propose a method that overcomes these limitations in two stages. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to learn a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma’s Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.


The Most Promising Indie Games That Showed Up at E3, From 'Sable' to 'NeoCab'

WIRED

E3 is widely considered a conference for big games, and understandably so; the largest publishers in the industry dominate the event, debuting trailers and news for the most expensive and expansive videogames they could possibly produce. But it's not impossible to find compelling independent games at the show, either: here are our picks for five that you'll want to keep your eyes on in the months to come. NeoCab is a game about the emotional labor of the gig economy, in a moody cyberpunk futurescape. You play one of the last human cab drivers, competing against an army of automated cars. The narrative forces you to balance the emotional health of your protagonist with the brutal needs of the job, as you struggle to barely--just barely--eke out a living.


Fair Information Sharing for Treasure Hunting

AAAI Conferences

In a search task, a group of agents compete to be the first to find the solution. Each agent has different private information to incorporate into its search. This problem is inspired by settings such as scientific research, Bitcoin hash inversion, or hunting for some buried treasure. A social planner such as a funding agency, mining pool, or pirate captain might like to convince the agents to collaborate, share their information, and greatly reduce the cost of searching. However, this cooperation is in tension with the individuals' competitive desire to each be the first to win the search. The planner's proposal should incentivize truthful information sharing, reduce the total cost of searching, and satisfy fairness properties that preserve the spirit of the competition. We design contract-based mechanisms for information sharing without money. The planner solicits the agents' information and assigns search locations to the agents, who may then search only within their assignments. Truthful reporting of information to the mechanism maximizes an agent's chance to win the search. Epsilon-voluntary participation is satisfied for large search spaces. In order to formalize the planner's goals of fairness and reduced search cost, we propose a simplified, simulated game as a benchmark and quantify fairness and search cost relative to this benchmark scenario. The game is also used to implement our mechanisms. Finally, we extend to the case where coalitions of agents may participate in the mechanism, forming larger coalitions recursively.