Why is Your Language Model a Poor Implicit Reward Model?

Open in new window