Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention
–Neural Information Processing Systems
The discovery of the *lazy neuron phenomenon* (Li et al., 2022), where fewer than 10% of the feedforward networks (FFN) parameters in trained Transformers are activated per token, has spurred significant interests in *activation sparsity* for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits across CPUs, GPUs, and TPUs, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity, e.g., by reverting to ReLU or applying top-k masking, often degrade model quality, increase parameter count, or complicate training.
Neural Information Processing Systems
Jun-11-2026, 04:13:09 GMT
- Technology: