APerformer architecture details
–Neural Information Processing Systems
We define the Performer architecture formally as follows. V Rdmodel d are trainable parameters (separate for each instance of MultiHead-Att, FFN), "+" is broadcasted rowwise when biases are added and LN is layer normalization [2], which is applied rowwise and depends on additional trainable parameters. GeLU denotes Gaussian error Linear Unit [16], which is applied elementwise. Similarly, U(n) does not affect L(1),...,L(n), so This way, the 3D tensor R RL d M is not stored in memory explicitly, resulting in O(L) time and O(L(d+ M) + dM) memory complexity. In order to have the same memory consumption during back-propagation, [18] propose the following routine.
Neural Information Processing Systems
Apr-25-2026, 11:00:33 GMT
- Technology: