A Implementation of PS CD Algorithm

Feb-10-2025, 11:52:20 GMT–Neural Information Processing Systems

Mathematically, these two implementations are equivalent and we use different implementations here to illustrate the difference between their derivations. B.1 Proof for Theorem 2 In this section, we provide two different ways to prove Theorem 2. The first one is more straightforward and directly differentiates through the term log( q To solve this issue, we introduce the following variational representation: Lemma 1. With Jensen's inequality, we have: () ( Furthermore, since convergence in probability is also preserved under addition transformation, the gradient estimator in Equation (24) converges to the true gradient in Equation (23) in probability. As introduced in Equation (9) in Section 2.3, the divergence corresponding to the γ-scoring rule is: D The above lemma implies that the KL divergence minimization (maximum likelihood estimation) is a special case of γ-divergence minimization when γ 0, which also implies the following corollary: Corollary 1. This is a direct consequence of Lemma 2. It can also be verified by checking the PS-CD gradient in Equation (18) (when γ 0, r Lemma 3. When 1 γ < 0, we have: In this section, we provide a theoretical analysis on the sample complexity of the gradient estimator, as well as the convergence property of stochastic gradient descent with consistent (but biased given finite samples) gradient estimators as presented in Algorithm 1. C.1 Sample Complexity We start with analyzing the sample complexity of the consistent gradient estimator, that is how fast it approaches the true gradient value or how many samples we need in order to empirically estimate the gradient at a given accuracy with a high probability.