self-attention layer
LimitstoDepth-EfficienciesofSelf-Attention
Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Switzerland > Basel-City > Basel (0.06)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Transformer architectures have shown impressive performance in multiple research domains and have become the backbone of many neural network models. However, there is limited understanding on how Transformer works. In particular, with a simple predictive loss, how the representation emerges from the gradient training dynamics remains a mystery. In this paper, we analyze the SGD training dynamics for 1-layer transformer with one self-attention plus one decoder layer, for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a discriminative scanning algorithm: starting from uniform attention, it gradually attends more to key tokens that are distinct for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but decelerates due to a phase transition that is controllable by the learning rates of the two layers, leaving (almost) fixed token combination. We verify this scan and snap dynamics on synthetic and real-world data (WikiText).
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > Texas (0.04)
- North America > United States > Michigan (0.04)
- North America > Dominican Republic (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Increasing GPU Utilization during Generative Inference for Higher Throughput
Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.
- North America > United States > California > San Diego County > Carlsbad (0.04)
- Asia > Taiwan (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > China (0.04)
6 Appendix
We observe that for the self-attention layers, the correlation of weights for the same head is stronger. Additionally, the best grouping might depend on the type of the layer (e.g., key, query, value, or To simplify the implementation, we treat all the different kernels in the self-attention as a type of fully-connected layer. We down-sample along each dimension to make the computation feasible. To relate with the Frobenius norm, we compute the square of each element and normalize the value. In Figure 5, we show the approximation error comparison for different approximation methods.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Workflow (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)