The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining