Not All Tokens Are What You Need for Pretraining
–Neural Information Processing Systems
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens.
Neural Information Processing Systems
Apr-26-2026, 17:18:33 GMT
- Technology: