Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention

Neural Information Processing Systems 

The discovery of the lazy neuron phenomenon [54], where fewer than 10% of the feedforward networks (FFN) parameters in trained Transformers are activated per token, has spurred significant interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits across CPUs, GPUs, and TPUs, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity, e.g., by reverting to ReLU, applying top-kmasking or a sparse predictor, often degrade model quality, increase parameter count, complicate training.