StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Wang, Ziliang, Zheng, Xuhui, An, Kang, Ouyang, Cijun, Cai, Jialu, Wang, Yuhang, Wu, Yichao
–arXiv.org Artificial Intelligence
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.
arXiv.org Artificial Intelligence
May-27-2025
- Country:
- Asia
- China
- Beijing > Beijing (0.04)
- Guangdong Province > Shenzhen (0.04)
- Jiangsu Province > Nanjing (0.04)
- Malaysia (0.14)
- Middle East
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- China
- Europe
- Germany (0.14)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- North America > United States
- Kansas (0.04)
- New Mexico
- Bernalillo County > Albuquerque (0.05)
- Rio Arriba County (0.04)
- Sandoval County (0.04)
- North Dakota > Ward County
- Minot (0.05)
- Oklahoma > Oklahoma County
- Oklahoma City (0.04)
- Asia
- Genre:
- Research Report (0.64)
- Industry:
- Government (1.00)
- Technology: