A Proofs Throughout this section, we use p(s =a) to denote the probability of the state-action pair at time step t being equal to (s, a), and the probability of a trajectory by p(τ) = p(s, a

Open in new window