A Training and implementation details
–Neural Information Processing Systems
We train the baseline ensembles for 200 epochs using SGD with momentum 0.9 and weight decay The initial learning rate is 0.1, and we decay it by 10 In our implementation, DVERGE starts from the trained baseline ensembles. Also note that GAL requires the ReLU function to be replaced with leaky ReLU to avoid gradient vanishing. Ensembles with adversarial training follow the baseline's training setup. We use 0.5 as the input transformation probability for M-DI As for previous methods, though ADP requires the least time budget, it does not improve the robustness much as shown in Figure 4. The number in the first column after the slash is the number of sub-models within the ensemble.
Neural Information Processing Systems
Nov-13-2025, 20:57:10 GMT
- Technology: