A Appendix

Mar-27-2025, 12:19:27 GMT–Neural Information Processing Systems

Memory Cost of Self-attention Weights in DETR: DETR has six encoder-decoder pairs. Figure 1 presents the structure of the encoder, decoder, and embedded Multi-Head Self-Attention (MHSA) layer. Each MHSA layer has a self-attention weight tensor produced by the multiplication of Query and Key as shown in Figure 1. The memory cost of this tensor during training under different hyperparameter settings and optimization strategies are plotted in Figure 2. It shows that more attention heads, especially large downsampling ratios, significantly increase the memory cost. Additionally, Adam and AdamW optimizers, commonly used to train vision transformers, take more memory than simple SGD.

artificial intelligence, attention weight, memory cost, (16 more...)

Neural Information Processing Systems

Mar-27-2025, 12:19:27 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.50)