AITopics | softmax operator

0a113ef6b61820daa5611c870ed8d5ee-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 14:57:39 GMT

artificial intelligence, machine learning, qmix, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Regularized Softmax Deep Multi-Agent Q-Learning

Neural Information Processing SystemsApr-24-2026, 14:57:35 GMT

Tackling overestimation in Q-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multiagent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent Q-Learning, is general and can be applied to any Q-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.

Add feedback

Appendix for Softmax Deep Double Deterministic Policy Gradients Ling Pan

Neural Information Processing SystemsFeb-9-2026, 06:33:38 GMT

We demonstrate the smoothing effect of SD3 on the optimization landscape in this section, where experimental setup is the same as in Section 4.1 in the text for the comparative study of SD2 and Experimental details can be found in Section B.2. The performance comparison of SD3 and TD3 is shown in Figure 1(a), where SD3 significantly outperforms TD3. So far, we have demonstrated the smoothing effect of SD3 over TD3. Hyperparameters of DDPG and SD2 are summarized in Table 1. Assume that the actor is a local maximizer with respect to the critic.

artificial intelligence, machine learning, sd3, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.61)

Add feedback

884d247c6f65a96a7da4d1105d584ddd-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 06:33:31 GMT

DDPG [24]extends Q-learning to continuous control based on the Deterministic Policy Gradient [31] algorithm, which learns a deterministic policyπ(s;φ) parameterized byφto maximize the Q-function to approximate themaxoperator.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)

Add feedback

0a113ef6b61820daa5611c870ed8d5ee-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 10:34:46 GMT

qmix, res-qmix, softmax operator, (15 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

0a113ef6b61820daa5611c870ed8d5ee-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 10:34:42 GMT

overestimation bias, softmax operator, value estimate, (11 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Softmax Deep Double Deterministic Policy Gradients

Neural Information Processing SystemsDec-24-2025, 06:28:25 GMT

A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.

deep double deterministic policy gradient, deterministic policy gradient, policy gradient, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

Nair, Pravin

arXiv.org Artificial IntelligenceOct-28-2025

The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.

artificial intelligence, lipschitz constant, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.23012

Genre: Research Report > New Finding (0.46)

Technology: