What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Open in new window