Gong, Hailei
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Wang, Teng, Jiang, Zhangyi, He, Zhenqi, Yang, Wenhan, Zheng, Yanan, Li, Zeyu, He, Zifan, Tong, Shenyang, Gong, Hailei
Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In this paper, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM excels in evaluating multi-step reasoning coherence and self-reflection, especially in scenarios where the previous reasoning step is incorrect but the subsequent step successfully identifies and corrects the error. Furthermore, to address the inefficiency of autonomous annotating PRM training data via Monte Carlo Tree Search (MCTS), we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC) based on node merging (combining two consecutive reasoning steps into one step) in the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K dataset confirm HRM's superior generalization and robustness across diverse reasoning tasks.
Decision Information Meets Large Language Models: The Future of Explainable Operations Research
Zhang, Yansen, Kang, Qingcan, Yu, Wing Yin, Gong, Hailei, Fu, Xiaojin, Han, Xiongwei, Zhong, Tao, Ma, Chen
Operations Research (OR) is vital for decision-making in many industries. While recent OR methods have seen significant improvements in automation and efficiency through integrating Large Language Models (LLMs), they still struggle to produce meaningful explanations. This lack of clarity raises concerns about transparency and trustworthiness in OR applications. To address these challenges, we propose a comprehensive framework, Explainable Operations Research (EOR), emphasizing actionable and understandable explanations accompanying optimization. The core of EOR is the concept of Decision Information, which emerges from what-if analysis and focuses on evaluating the impact of complex constraints (or parameters) changes on decision-making. Specifically, we utilize bipartite graphs to quantify the changes in the OR model and adopt LLMs to improve the explanation capabilities. Additionally, we introduce the first industrial benchmark to rigorously evaluate the effectiveness of explanations and analyses in OR, establishing a new standard for transparency and clarity in the field.
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
Wang, Teng, Yu, Wing-Yin, He, Zhenqi, Liu, Zehua, Han, Xiongwei, Gong, Hailei, Wu, Han, Shi, Wei, She, Ruifeng, Zhu, Fangzhou, Zhong, Tao
LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, a algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions.