Media
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining Jacob Portes
Although BERT -style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT.