conv
Supplementary information for: Natural image synthesis for the retina with variational information bottleneck representation
To obtain a bound on the Information Bottleneck Gaussian Process (IB-GP) objective, we use the Markov chain constraint Y X Z and the factorized joint distribution [2]: p(X,Y,Z) = p(Y|X,Z)p(Z|X)p(X) = p(Y|X)p(Z|X)p(X) (1) to expand the mutual information terms in LIB = max I(Z,Y) ฮฒI(Z,X) . Henceforth, we use the stochastic encoder pฯ(Z|X)parameterized by ฯas an approximation for p(Z|X). In practice computation of H(Z) might be intractable (even though P(Z)is well defined). Therefore, a variational approximation ฯ(Z) is used in place of p(Z) such that KL(p(Z),ฯ(Z)) is minimal. In practice computation of p(Y,Z)and p(Y|Z)might be intractable (even though they are well defined).
!011Im2Col0 1
We adopt a residual network (ResNet) [23] based feature extractor, with ELU as the activation function. Following [15], we adopt group normalization and instance normalization for better stability of the networks. We adopt the "leave-one-out" training strategy for obtaining the results on each of the categories of MVTec-AD. All experiments are performed with the same settings and hyperparameters. We resize all images to 128 128, and do not perform any data augmentation.
setup
A.1 Datasets We use two standardized few-shot image classification datasets. Mini-ImageNet: This dataset [58] is a subset of ImageNet [10] and consists of 64 classes for training, 16 for validation, and 20 for testing. There are 600 images per class, with images of size 84 84. Multiple versions of this dataset exist in the literature; we use the version by Ravi and Larochelle [43]. Tiered-ImageNet: A larger subset of ImageNet, Tiered-ImageNet [45] consists of 608 classes split into 351, 97, and 160 for training, validation, and testing, respectively.
Revisiting the Integration of Convolution and Attention for Vision Backbone
Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots.