Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging