Goto

Collaborating Authors

 normalize




Slimmed Asymmetrical Contrastive Learning and Cross Distillation for Lightweight Model Training 1 Supplementary Material

Neural Information Processing Systems

In Section 3.2, we proposed the crossdistillation (XD) learning scheme. The distillation objective in Eq (10) is the inner decorrelation minimization between embeddings z and [ z]. In addition to the correlation-based distillation loss, we also investigate the negative logarithm(e.g, To avoid the unbalanced loss magnitude, the distillation loss is introduced as the regularization term controlled by the penalty level γ: L = LSACL(zA,zB)+γLCD (1) LCD = ( [ zA]logzA + [ zB]logzB)/2 (2) We empirically observe that the negative logarithm-based distillation loss failed to outperform the proposed cross-distillation loss LCD with inner-decorrelation minimization. As shown in the ImageNet-100 results below: Method Encoder # of Params (M) Linear Eval Acc.


Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training

Neural Information Processing Systems

Normalization techniques have become a basic component in modern convolutional neural networks (ConvNets). In particular, many recent works demonstrate that promoting the orthogonality of the weights helps train deep models and improve robustness. For ConvNets, most existing methods are based on penalizing or normalizing weight matrices derived from concatenating or flattening the convolutional kernels. These methods often destroy or ignore the benign convolutional structure of the kernels; therefore, they are often expensive or impractical for deep ConvNets. In contrast, we introduce a simple and efficient "Convolutional Normalization" (ConvNorm) method that can fully exploit the convolutional structure in the Fourier domain and serve as a simple plug-and-play module to be conveniently incorporated into any ConvNets. Our method is inspired by recent work on preconditioning methods for convolutional sparse coding and can effectively promote each layer's channel-wise isometry. Furthermore, we show that our ConvNorm can reduce the layerwise spectral norm of the weight matrices and hence improve the Lipschitzness of the network, leading to easier training and improved robustness for deep ConvNets. Applied to classification under noise corruptions and generative adversarial network (GAN), we show that the ConvNorm improves the robustness of common ConvNets such as ResNet and the performance of GAN. We verify our findings via numerical experiments on CIFAR and ImageNet.


Exact Bayesian Inference on Discrete Models via Probability Generating Functions: AProbabilistic Programming Approach

Neural Information Processing Systems

We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called Genfer, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.


Scalable Optimization in the Modular Norm

Neural Information Processing Systems

To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the natural norm particular to that layer. In this paper, we significantly generalize this idea by defining the modular norm, which is the natural norm on the full weight space of any neural network architecture. The modular norm is defined recursively in tandem with the network architecture itself. We show that the modular norm has several promising applications.





e92381dba235a8309f08ce46376189a9-Supplemental-Conference.pdf

Neural Information Processing Systems

We use the symmetrized cosine similarity loss from SimSiam. Model details For CIFAR10, we use pretrained StyleGAN available at the official website of StyleGAN-Ada[31]2. We also experimented with the model with best Inception score3 but did not observe significant difference in results. Linear classification The quality of the pretrained representations is evaluated by training a supervised linear classifier on frozen representationshinthe training set, and then testing itinthe validationset.