Goto

Collaborating Authors

 fashionmnist


Figure 9: In experiments, we used a common feature-extractor (F

Neural Information Processing Systems

Here, we include implementation details omitted from the main paper for brevity. Upon acceptance, a deanonymized repository will be released. The last layer's dimension depended upon the exact The feature extractors and decoders varied by domain. In particular, we found that if we did not apply this linear transformation (i.e., pass the raw encodings For VQ-based methods, use a large enough codebook to have at least one element per class. Other differences simply reflected differences in architecture (e.g., For iNat, we trained all models with batch size 256, using the hyperparameters specified in Table 3.


Appendix for " Residual Alignment: Uncovering the Mechanisms of Residual Networks " Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We start by providing motivation for the unconstrained Jacobians problem introduced in the main text. We will continue our proof using contradiction. Figure 1: Fully-connected ResNet34 (Type 1 model) trained on MNIST.Figure 2: Fully-connected ResNet34 (Type 1 model) trained on FashionMNIST. Figure 10: Fully-connected ResNet34 (Type 1 model) trained on MNIST. Figure 24: Fully-connected ResNet34 (Type 1 model) trained on MNIST.






Checklist

Neural Information Processing Systems

Alldatausedispublic.] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they werechosen)? A.1 TrainingDetails In our experiments, the classifierfθ is a 8-layer MLP with 128 hidden dimensions per layer.


aae3ff05a5638ce4e2ef2fbc04229797-Supplemental-Conference.pdf

Neural Information Processing Systems

The total loss of the model is a combination of both regularization terms and a reconstructionloss. Herexr refers to reference image,xa to adversarial image and xr, xa to their corresponding reconstructions. The maximum input noise perturbation levelλ is limited to1,3 and 5. However, it should be also noted that with PGD-based training, the computational time is two times more expensive than our original method. These attacks are more successful when the adversarial reconstructions are less similar in appearance to the clean reconstructions.