contextprm
ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
Zhang, Haotian, Liu, Liu, Yu, Baosheng, Qiu, Jiayan, Xiao, Likang, Ren, Yanwei, Chen, Quan, Liu, Xianglong
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. T o address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on contextual coherence between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. F or instance, our resulting model, ContextPRM, achieves a notable 6.5% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2% improvement from V ersaPRM and 0.5% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.