RecurrentMemoryTransformer

Neural Information Processing Systems 

Results ofexperiments showthatRMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance.