APerformer architecture details

Apr-25-2026, 11:00:33 GMT–Neural Information Processing Systems

We define the Performer architecture formally as follows. V Rdmodel d are trainable parameters (separate for each instance of MultiHead-Att, FFN), "+" is broadcasted rowwise when biases are added and LN is layer normalization [2], which is applied rowwise and depends on additional trainable parameters. GeLU denotes Gaussian error Linear Unit [16], which is applied elementwise. Similarly, U(n) does not affect L(1),...,L(n), so This way, the 3D tensor R RL d M is not stored in memory explicitly, resulting in O(L) time and O(L(d+ M) + dM) memory complexity. In order to have the same memory consumption during back-propagation, [18] propose the following routine.

artificial intelligence, curr, machine learning, (16 more...)

Neural Information Processing Systems

Apr-25-2026, 11:00:33 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.36)

Duplicate Docs Excel Report

Title
35309226eb45ec366ca86a4329a2b7c3-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found