A training

Neural Information Processing Systems 

Table 4 describes the hyperparameters for pre-training the baseline and PLD. Eqn. 5 indicates that the gradient Figure 1 shows the full comparison of the baseline and PLD, fine-tuned at different checkpoints. Specifically, the fine-tuning results are often much worse with a large learning rate. Figure 11: The fine-tuning results at different checkpoints.Figure 12: Convergence curves varying the keep ratio θ .