Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Open in new window