A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

Open in new window