Goto

Collaborating Authors

 prop


Delightful Policy Gradient

Osband, Ian

arXiv.org Machine Learning

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.


ea89621bee7c88b2c5be6681c8ef4906-AuthorFeedback.pdf

Neural Information Processing Systems

In contrast, we use 10% of the training set9 for validation, and treat the validation set as apurely held-out test set (this also means that we train on less data).10 Wewillexplainthismoreclearly.30 both spheres are sufficiently tiny (i.e.


51cdbd2611e844ece5d80878eb770436-AuthorFeedback.pdf

Neural Information Processing Systems

Optimal Transport (OT) + Fairness (R2, R4): Let us highlight two key differences between "Wasserstein Fair10 Classification" (Jiang et al.) and our work. Generally group fairness constraints31 are trying to reflect a certain independence between the prediction and the sensitive attribute.