c20bb2d9a50d5ac1f713f8b34d9aac5a-AuthorFeedback.pdf
–Neural Information Processing Systems
Both initialization methods ondownstream tasks canachievesimilar performance, butinitializing from BERT-base33 reduces the number of learning steps. In order to shorten the training time of our large-size model, we initialize34 it from BERT-large. We will also release a model trained from scratch. We further trained BERT-large using the35 same hyper-parameters, buttheresulted model didn'tsignificantly improvedownstream tasks compared tooriginal36 BERT-large.
Neural Information Processing Systems
Feb-13-2026, 23:56:25 GMT
- Technology: