Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention

Jun-11-2026, 04:13:09 GMT–Neural Information Processing Systems

The discovery of the *lazy neuron phenomenon* (Li et al., 2022), where fewer than 10% of the feedforward networks (FFN) parameters in trained Transformers are activated per token, has spurred significant interests in *activation sparsity* for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits across CPUs, GPUs, and TPUs, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity, e.g., by reverting to ReLU or applying top-k masking, often degrade model quality, increase parameter count, or complicate training.

artificial intelligence, machine learning, sparsity, (10 more...)

Neural Information Processing Systems

Jun-11-2026, 04:13:09 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)