When BERT meets Pytorch

Sep-1-2019, 03:31:35 GMT–#artificialintelligence

We keep the BERT encoder unfrozen so that all weights are updated with every iteration. Given the number of trainable parameters it's useful to train the model on multiple GPUs in parallel. I used 4 Tesla K80's for about 4500 training samples. Just remember that to access any model attribute, you can access it using modelName.module.attribute I used Stochastic Gradient Descent with momentum as the optimizer and found that cycling both the learning rates and momentum really helped to get the training and validation losses down.

artificial intelligence, machine learning, modelname, (15 more...)

#artificialintelligence

Sep-1-2019, 03:31:35 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning > Gradient Descent (0.65)
  - Neural Networks > Deep Learning (0.40)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found