Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Open in new window