Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

Faisal, Fahim, Song, Kaiqiang, Wang, Song, Ma, Simin, Liu, Shujian, Deng, Haoyun, Indurthi, Sathish Reddy

Oct-1-2025–arXiv.org Artificial Intelligence

While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically V erifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents. The reasoning capabilities of Large Language Models (LLMs) have advanced dramatically, driven by sophisticated training paradigms such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) and innovations in policy optimization algorithms like Proximal Policy Optimization (PPO) (Schulman et al., 2017a) such as REINFORCE++ (Hu et al., 2025) and Group Regularized Policy Optimization (GRPO) (Shao et al., 2024).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)
- Europe > Austria (0.28)
- Asia (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found