Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

Open in new window