Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Open in new window