c20bb2d9a50d5ac1f713f8b34d9aac5a-AuthorFeedback.pdf

Feb-13-2026, 23:56:25 GMT–Neural Information Processing Systems

Both initialization methods ondownstream tasks canachievesimilar performance, butinitializing from BERT-base33 reduces the number of learning steps. In order to shorten the training time of our large-size model, we initialize34 it from BERT-large. We will also release a model trained from scratch. We further trained BERT-large using the35 same hyper-parameters, buttheresulted model didn'tsignificantly improvedownstream tasks compared tooriginal36 BERT-large.

artificial intelligence, machine learning, responsetoreviewer, (3 more...)

Neural Information Processing Systems

Feb-13-2026, 23:56:25 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.39)

Duplicate Docs Excel Report

Title
model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge

Similar Docs Excel Report more

Title	Similarity	Source
None found