A Training and implementation details

Neural Information Processing Systems 

We train the baseline ensembles for 200 epochs using SGD with momentum 0.9 and weight decay The initial learning rate is 0.1, and we decay it by 10 In our implementation, DVERGE starts from the trained baseline ensembles. Also note that GAL requires the ReLU function to be replaced with leaky ReLU to avoid gradient vanishing. Ensembles with adversarial training follow the baseline's training setup. We use 0.5 as the input transformation probability for M-DI As for previous methods, though ADP requires the least time budget, it does not improve the robustness much as shown in Figure 4. The number in the first column after the slash is the number of sub-models within the ensemble.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found