Not All Tokens Are What You Need for Pretraining

Neural Information Processing Systems 

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens.