Memory Complexity with Transformers - KDnuggets
The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM. There are a lot of deep explanations elsewhere so here I'd like to share some example questions in an interview setting. What can be a solution to this problem? Here are some tips for readers' reference: If you try to run a large transformer on the long sequence, you just run out of memory. A limitation of existing Transformer models and their derivatives is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length.
Dec-9-2022, 16:53:01 GMT