AITopics | softmax distribution

The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. We show that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution and the full softmax gradient.

random fourier feature, sampled softmax, softmax distribution, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

Tzachristas, Georgios, Deng, Lei, Tzachristas, Ioannis, Zhang, Gong, Chen, Renhai

arXiv.org Artificial IntelligenceDec-9-2025

We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e^{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2=τ\|μ_{\mathrm{tail}}-μ_{\mathrm{head}}\|_2$ with $τ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2\leτ\,\mathrm{diam}_{H,T}$ and refinements linking the error to $\mathrm{Var}_P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(μ,σ^2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k_\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k_\varepsilon/n\approxΦ_c(σ+Φ^{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k_\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.

artificial intelligence, machine learning, truncation, (17 more...)

arXiv.org Artificial Intelligence

2512.07647

Country: Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.36)

Add feedback

73f95ee473881dea4afd89c06165fa66-Paper.pdf

Neural Information Processing SystemsOct-3-2025, 06:14:13 GMT

latent class, latent space, multimodality, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Sampled Softmax with Random Fourier Features

Ankit Singh Rawat, Jiecao Chen, Felix Xinnan X. Yu, Ananda Theertha Suresh, Sanjiv Kumar

Neural Information Processing SystemsAug-20-2025, 06:54:16 GMT

Motivated by our analysis and the work on kernel-based sampling, we propose the Random F ourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. We show that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution and the full softmax gradient.

kernel, rf-softmax, softmax distribution, (13 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

e1194b07221de8783a02087bb899d69f-Paper-Conference.pdf

Neural Information Processing SystemsAug-19-2025, 13:02:40 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
(2 more...)

Add feedback

Calibrating Deep Neural Networks using Focal Loss

Neural Information Processing SystemsAug-15-2025, 19:38:36 GMT

Deep Neural Networks (DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated and confident.

calibration, cross-entropy loss, focal loss, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Sampled Softmax with Random Fourier Features

Neural Information Processing SystemsMay-27-2025, 17:04:10 GMT

The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution.

random fourier feature, sampled softmax, softmax distribution, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Filters

Collaborating Authors

softmax distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

e1194b07221de8783a02087bb899d69f-Paper-Conference.pdf

Sampled Softmax with Random Fourier Features

73f95ee473881dea4afd89c06165fa66-Paper.pdf

Sampled Softmax with Random Fourier Features

A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

73f95ee473881dea4afd89c06165fa66-Paper.pdf

Sampled Softmax with Random Fourier Features

e1194b07221de8783a02087bb899d69f-Paper-Conference.pdf

Calibrating Deep Neural Networks using Focal Loss

Sampled Softmax with Random Fourier Features