Navigating Extremes: Dynamic Sparsity in Large Output Spaces

Neural Information Processing Systems 

In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found