Accelerating Nash Learning from Human Feedback via Mirror Prox

Open in new window