Goto

Collaborating Authors

 synchronously



Temporal-Difference Learning Using Distributed Error Signals

Neural Information Processing Systems

A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks. We empirically evaluate our algorithm on MinAtar, the DeepMind Control Suite, and classic control tasks, and show it often achieves comparable performance to deep RL algorithms that use backpropagation.


K-level Reasoning for Zero-Shot Coordination in Hanabi

Neural Information Processing Systems

Work done while at Facebook AI Research 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Figure 1: Visualization of various hierarchical training schemas, including sequential KLR, synchronous KLR, synchronous CH, and our new SyKLRBR for 4 levels.


Temporal-Difference Learning Using Distributed Error Signals

Neural Information Processing Systems

A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks.


K-level Reasoning for Zero-Shot Coordination in Hanabi

Cui, Brandon, Hu, Hengyuan, Pineda, Luis, Foerster, Jakob N.

arXiv.org Artificial Intelligence

The standard problem setting in cooperative multi-agent settings is self-play (SP), where the goal is to train a team of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions ("handshakes") and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by Hu et al. 2020 as the zero-shot coordination (ZSC) setting and partially addressed with their Other-Play (OP) algorithm, which showed improved ZSC and human-AI performance in the card game Hanabi. OP assumes access to the symmetries of the environment and prevents agents from breaking these in a mutually incompatible way during training. However, as the authors point out, discovering symmetries for a given environment is a computationally hard problem. Instead, we show that through a simple adaption of k-level reasoning (KLR) Costa Gomes et al. 2006, synchronously training all levels, we can obtain competitive ZSC and ad-hoc teamplay performance in Hanabi, including when paired with a human-like proxy bot. We also introduce a new method, synchronous-k-level reasoning with a best response (SyKLRBR), which further improves performance on our synchronous KLR by co-training a best response.


Smooth Q-learning: Accelerate Convergence of Q-learning Using Similarity

Liao, Wei, Wei, Xiaohui, Lai, Jizhou

arXiv.org Artificial Intelligence

An improvement of Q-learning is proposed in this paper. It is different from classic Q-learning in that the similarity between different states and actions is considered in the proposed method. During the training, a new updating mechanism is used, in which the Q value of the similar state-action pairs are updated synchronously. The proposed method can be used in combination with both tabular Q-learning function and deep Q-learning. And the results of numerical examples illustrate that compared to the classic Q-learning, the proposed method has a significantly better performance.


Teaching HAII

#artificialintelligence

Human-AI Interaction (HAII) was taught twice in a fully remote fashion over 12 weeks at Williams College to only undergraduates. The course was organized around 11.5 modules (i.e., topics), and each module included: 2x pre-recorded lecture videos, 2-3 readings (1 research paper 2 popular media), a 7 question quiz on that module's materials, a 60 minute synchronous [remote] class meeting with 8 students, followed by discussion forum posts and 2x responses to peers. The course was implemented in Canvas, referred to as "GLOW" at Williams, but I have made available some versions of the materials via Google documents on this website (apologies for any formatting issues!). Context for each course component is below, but details can be found on the Schedule and Syllabus pages. Lectures: Pre-recorded lectures for a module were posted twice per week (Thursdays & Mondays), which were recorded in 15 minute sections and linked together in a YouTube playlist for no more than 50 minutes total, although usually around 30-40 minutes.