A Proofs

Neural Information Processing Systems 

To prove the main results, we need the following Lemma and Propositions 1 and 2. The Lemma is a Continuing from Eq.(19), we have: n E null 1 n A.4 Time complexity of gradient calculation in ML-CPC Suppose g is a neural network parametrized by θ, then the gradient to the ML-CPC objective is So the time complexity to compute the ML-CPC gradient is O ( nm). We include a PyTorch implementation to α -ML-CPC as follows. Alternatively, one can use kl_div() to ensure that the loss is non-negative. C.2 Mutual information estimation The general procedure follows that in [40] and [44]. We consider two types of architectures - joint and separable .