Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level