Goto

Collaborating Authors

 table 2


A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

arXiv.org Machine Learning

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.


TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification: Appendix

Neural Information Processing Systems

For ease and reliable comparison, we report the average of all Rank-1 and mAP results on all test datasets over several random runs for ablation study and parameter analysis. This is denoted by mAcc. There are three reasons that we use mAcc. It is a unified measure, which is convenient for algorithm comparison. Both Rank-1 and mAP are accuracy measures ranging from 0%-100%, thus averaging them is possible. Besides, if a method's mAcc is 1% higher than another method, on average it means that every single measure on each dataset has been increased by 1%, which is a perceptible achievement.


e6c2e85db1f1039177c4495ccd399ac4-Supplemental-Conference.pdf

Neural Information Processing Systems

A.1 Preliminary Study2 The basic GPT-2 model1 is trained from scratch on each corpus, which has 12 transformer blocks3 and 12 attention heads with 768 hidden dimensions. The Huggingface transformers [4] and Pytorch4 toolkit [2] are used to train the GPT-2 model in the distributed manner on A100 GPU server. The5 hyper-parameters during training are shown in Table 1.6 Hyper-parameter Value Optimization steps 100K Test interval 10K Dropout rate 0.1 Grad clipping 1.0 Learning rate 5e 5 Batch size 128 Maximum sequence length 256 Warmup steps 10K Learning scheduler Linear decay Random seed 0 Number of GPUs 4 Learning objective Cross-Entropy Loss Table 1: The hyper-parameters during GPT-2 training procedure. Most of the hyper-parameters for our proposed method are the same as that in Table 1 for better8 variable controlling. The specific hyper-parameters for our proposed method are the length of9 repetitive n-gram and its repetition dropout rate p, which are set as 2 and 0.6, respectively.10





on Fine tuning with a Dense Model

Neural Information Processing Systems

Our 8BMoE model achieves stronger pre-training perplexity than its dense counterpart. However, a better perplexity does not always directly translate to downstream performance as demonstrated in Section 4.4. To this end, we compare fine-tuning performance of the 8B dense model and MoE model in Table 1. As shown in the table, our MoE model using expert choice routing consistently outperforms the dense model across the 11 tasks in GLUE and SuperGLUE. We evaluate the downstream task fine-tuning performance by varying the capacity factors.


259a5df46308d60f8454bd4adcc3b462-Supplemental-Conference.pdf

Neural Information Processing Systems

As action decoder their mentioned architectures of is multimodal adopted in the in to paper Figure information generate, the 1. visual-gr natural with languages cross-attention ounded alignment conditioned blocks, decoder on while the is visual applied the visual-grounded input. Based on these deeply fused representations, we finally generate the predicted answers with the visual-grounded generation decoder. In this section, we describe the settings used when fine-tuning the pretrained models on various downstream tasks. We use RandomAugment [1] for data augmentation. The default settings for finetuning on each dataset are shown in Table 1.