A Code Our codebase can be found in [

Neural Information Processing Systems 

Each batch is made of many randomly sampled trajectory snippets. This is a simple way of performing multi-task training. As we mention in section 3.1, for each trajectory snippet, first, Indeed, this is a common masking scheme in NLP uses of BERT. An important hyperparameter for transformer models is what dimension to use self-attention over. To obviate this problem, we stack states, actions, and rewards for each timestep, treating them as single inputs.