second inequality
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Shang, Shuning, Strauss, Hubert, Wei, Stanley, Arora, Sanjeev, Razin, Noam
Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.
Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards Appendix AFormal Definition of Inhomogeneous Poisson Process
The inhomogeneous Poisson (point) process is a Poisson point process with a Poisson parameter set as some time-dependent function r(τ). Let N(a,b) represent the number of points of inhomogeneous Poisson process with intensity function r(t) occurring in the interval [a,b], then the probability of n points existing in the interval [a,b] is given by, P(N(a,b) = n) Λ(a,b)n n! In this paper, the points mean the conversions and the time-dependent intensity function r() is defined in Eq. (2) and it depends on the realization of the conversions and parameter θ. Suppose X1, Xn are independent, mean-zero, subexponential random variables, and a = (a1,,an) is an ndimensional constanst vector. We first introduce the main idea of the the PAMM algorithm.
Detection of local geometry in random graphs: information-theoretic and computational limits
Bok, Jinho, Li, Shuangping, Yu, Sophie H.
We study the problem of detecting local geometry in random graphs. We introduce a model $\mathcal{G}(n, p, d, k)$, where a hidden community of average size $k$ has edges drawn as a random geometric graph on $\mathbb{S}^{d-1}$, while all remaining edges follow the Erdős--Rényi model $\mathcal{G}(n, p)$. The random geometric graph is generated by thresholding inner products of latent vectors on $\mathbb{S}^{d-1}$, with each edge having marginal probability equal to $p$. This implies that $\mathcal{G}(n, p, d, k)$ and $\mathcal{G}(n, p)$ are indistinguishable at the level of the marginals, and the signal lies entirely in the edge dependencies induced by the local geometry. We investigate both the information-theoretic and computational limits of detection. On the information-theoretic side, our upper bounds follow from three tests based on signed triangle counts: a global test, a scan test, and a constrained scan test; our lower bounds follow from two complementary methods: truncated second moment via Wishart--GOE comparison, and tensorization of KL divergence. These results together settle the detection threshold at $d = \widetildeΘ(k^2 \vee k^6/n^3)$ for fixed $p$, and extend the state-of-the-art bounds from the full model (i.e., $k = n$) for vanishing $p$. On the computational side, we identify a computational--statistical gap and provide evidence via the low-degree polynomial framework, as well as the suboptimality of signed cycle counts of length $\ell \geq 4$.
Noise-Adaptive Thompson Sampling for Linear Contextual Bandits
Linear contextual bandits represent a fundamental class of models with numerous real-world applications, and it is critical to developing algorithms that can effectively manage noise with unknown variance, ensuring provable guarantees for both worst-case constant-variance noise and deterministic reward scenarios.