c20bb2d9a50d5ac1f713f8b34d9aac5a-AuthorFeedback.pdf

Neural Information Processing Systems 

Both initialization methods ondownstream tasks canachievesimilar performance, butinitializing from BERT-base33 reduces the number of learning steps. In order to shorten the training time of our large-size model, we initialize34 it from BERT-large. We will also release a model trained from scratch. We further trained BERT-large using the35 same hyper-parameters, buttheresulted model didn'tsignificantly improvedownstream tasks compared tooriginal36 BERT-large.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found