S-STE: Continuous Pruning Function for Efficient 2: 4 Sparse Pre-training

Neural Information Processing Systems 

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity.