ablation study
Inducing Spatial Locality in Vision Transformers through the Training Protocol
Toledo, Eduardo Santiago, Martínez, Asael Fabian
We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.
Supplementary Materials of Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification
This supplementary material begins with a comprehensive visualization of the datasets central to our study. The specifics of our experimental settings are subsequently outlined in Section 1.2. Section 1.1 features an expanded analysis, including results from ablation studies. A key highlight of this section is the visual interpretation of the CLIP image features facilitated by t-SNE [6]. Concurrently, a comparative analysis is conducted, comparing the efficacy of interpolation-based strategies with our learning-based methods(i.e.
Re-Think and Re-Design Graph Neural Networks in Spaces of Continuous Graph Diffusion Functionals
S1.1 Step-by-step derivation of min-max optimization in Section 2.2.1 By substituting Eq. 2 into Eq. 1 in the main manuscript, we can obtain the objective function of subscript z (we temporarily drop ifor clarity): J(z) = max Since z might be in high dimensional space, solving such a large system of linear equations under the constraint |z| 1is oftentimes computationally challenging. In order to find a practical solution for z that satisfies the constrained minimization problem in Eq. By setting zl as point of coincidence, we can find a separable majorizer of M(z) by adding the non-negative function (z zl) (βI Gx Gx)(z zl) (S6) 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Note, to unify the format, we use the matrix transpose property in Eq. Then, the next step is to find z RN that minimizes z z 2bz subject to the constraint |z| 1. Let's first consider the simplest case where z is a scalar: argmin If b 1, then the solution is z = b.
On the Powerfulness of Textual Outlier Exposure for Visual OoDDetection (Appendix) AAdditional experimental results
This section presents more comprehensive experimental results. A.1 Comparison with post-hoc methods We also compare the performance of our textual outlier method with post-hoc approaches, which are another prominent approach in OoD detection. We conducted comparisons with six widely used and recently proposed methods known for their detection performance (MSP [4], ODIN [8], Mahalanobis [7], Energy [10], ReAct [14], KNN [15]). All advanced baseline methods follow the original paper's settings. Among these methods, our textual outlier approach demonstrate the best performance, further emphasizing its effectiveness as demonstrated in Table 6.
Supplementary Material for DreamHuman: Animatable 3DAvatars from Text
This document contains additional details and experiments that did not fit in the main text due to space constraints. For animations and additional results please also check the included videos. We use a similar optimization strategy with DreamFusion, so unless otherwise noted the hyperparameters remain the same. For example, we use the Distributed Shampoo optimizer [2]. Similarly with DreamFusion we also train on a TPUv4 machine with 4 chips.
Details and Ablation Studies for Language Modelling
A.1 Experimental Settings All language models in Table 1 have the same Transformer configuration: a 16-layer model with a hidden size of 128 with 8 heads, and a feed-forward dimension of 2048. We use a dropout [75, 76, 77] rate of 0.1. The batch size is 96 and we train for about 120 epochs with Adam optimiser [78] with an initial learning rate of 0.00025 and 2000 learning rate warm-up steps. All models are trained with a back-propagation span of 256 tokens. During training, these segments are treated independently, except for the + full context cases in Table 1 where the states (both recurrent states and fast weight states) from a segment are used as initialisation for the subsequent segment. The models in + full context cases are also evaluated in the same way by carrying over the context throughout the evaluation text with a batch size of one. For all other cases, the evaluation is done by going through the text with a sliding window of size 256 with a batch size of one. Transformer states are computed for all positions in each window, but only the last position is used to compute perplexity (except in the first segment where all positions are used for evaluation) [2].
1 2 a t2) v0 = q v2x + v2y + a t v0x = v0cos (θ0) v0y = v0sin (θ0). (2)
Define an agent's current state information as s = (x,y,θ,vx,vy), which includes the x,y positions in the coordinate space, and the yaw angle θ, and the velocities in the X and Y directions. The inverse kinematics can be used to calculate the actions for behavior cloning purpose. With the Bicycle action space, we propose a model to approximate the vehicle dynamics with the goal of minimizing the discrepancy between the predicted vehicle states and the recorded vehicle states. More specifically, define the vehicle's coordinates as x,y in the global coordinate system, and the predicted coordinates as ˆx,ˆy, the goal is to minimize (x ˆx)2 + (y ˆy)2. Define the current vehicle's state information as s, which includes the coordinates of the vehicle in the global coordinate system (x,y), the vehicle's yaw angle θ, the vehicle's speed in the x and y direction vx,vy.