Qi Cai Northwestern University Northwestern University Evanston, IL60208

Neural Information Processing Systems

Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve?


analysis of TD [21] requires an implicit local linearization with respect to the initial feature representation, which

Neural Information Processing Systems

We appreciate the valuable comments from the reviewers. We study the discretization of the trajectory of PDE in Proposition 3.1 and Appendix D, based on which we establish a discrete-time convergence rate in Corollary 4.4 by aggregating the the We will cite the paper in our revision. Thank you for pointing out. On the other hand, we do understand that Assumptions B.1 Thus, we put Q-learning in the appendix as an extension of our main results for TD. It is worth noting that UAT requires additional conditions on the target function, e.g., As UAT doesn't ensure the approximation of any In contrast, we show in Lemma C.1 that, The proof is technical and requires certain preliminary knowledge on optimal transport, such as the Wasserstein gradient flow. We will include the following flowchart of the proof in the revision.


A Appendix

Neural Information Processing Systems

G. From Eq. (4), we have: ϕ The proof is inspired by universality proofs of prior symmetrization approaches [102, 74, 41]. Let ψ: X Y be an arbitrary G equivariant function. We leave proving this as a future work. In general, we are interested in obtaining a faithful representation ρ, i.e., such that ρ(g) is distinct for each g. We now show the following: Proposition 3. The proposed distribution p We now show the following: Proposition 4. The proposed distribution p We also note that scale(Q) gives orthogonal matrix of determinant +1, as it returns Q if det(Q) = +1, otherwise (det(Q) = 1 since Q is orthogonal) scales the first column by 1 which flips determinant to +1 while not affecting orthogonality.