Training Details and Model

Neural Information Processing Systems 

We set the patch size to be 8. Our model is optimized by AdamW optimizer [3] with a learning rate2 of 0.0004, 250k training steps, linearly warm-up of 5000 steps and an exponentially weight-decaying3 schedule. The gradient norm is clipped at 1. We use Pytorch automatic mixed-precision and data4 paralleling for training acceleration. All models are trained on 4 Nvidia RTXA5000 GPUs with a5 total batch size of 128.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found