Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior