Goto

Collaborating Authors

 Kulikov, Ilia


SPICE: Self-Play In Corpus Environments Improves Reasoning

Liu, Bo, Jin, Chuanyang, Kim, Seungone, Yuan, Weizhe, Zhao, Wenting, Kulikov, Ilia, Li, Xian, Sukhbaatar, Sainbayar, Lanchantin, Jack, Weston, Jason

arXiv.org Artificial Intelligence

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.


Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Tao, Leitian, Kulikov, Ilia, Saha, Swarnadeep, Wang, Tianlu, Xu, Jing, Li, Sharon, Weston, Jason E, Yu, Ping

arXiv.org Artificial Intelligence

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.


J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Whitehouse, Chenxi, Wang, Tianlu, Yu, Ping, Li, Xian, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep

arXiv.org Artificial Intelligence

The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.


OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Aggarwal, Pranjal, Kim, Seungone, Lanchantin, Jack, Welleck, Sean, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep

arXiv.org Artificial Intelligence

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.


The Majority is not always right: RL training for solution aggregation

Zhao, Wenting, Aggarwal, Pranjal, Saha, Swarnadeep, Celikyilmaz, Asli, Weston, Jason, Kulikov, Ilia

arXiv.org Artificial Intelligence

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.


CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Yu, Ping, Lanchantin, Jack, Wang, Tianlu, Yuan, Weizhe, Golovneva, Olga, Kulikov, Ilia, Sukhbaatar, Sainbayar, Weston, Jason, Xu, Jing

arXiv.org Artificial Intelligence

We propose CoT -Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MA TH500, AMC23, AIME24, and GPQA-Diamond. The transformative rise of Large Language Models (LLMs) has initiated a substantial paradigm shift in the domain of deep learning (Zhang et al., 2023; Guo et al., 2023; Long et al., 2024). The development of such models emphasizes scale, and relies heavily on large volumes of high-quality data (Gandhi et al., 2024; Abdin et al., 2024). However, acquiring such data from human sources can often be challenging or even impractical due to factors such as high costs, data scarcity, and privacy concerns (Kurakin et al., 2023). Furthermore, several studies (Hosking et al., 2023; Singh et al., 2023; Gilardi et al., 2023) have pointed out that human-generated data, being inherently prone to biases and errors, may not always be ideal for model training or evaluation. In this context, synthetic data emerges as a viable alternative for obtaining high-quality datasets. Synthetic data is artificially generated to replicate the characteristics and patterns of real-world data. One innovative approach to creating such data is the Self-Instruct method (Wang et al., 2022a), which utilizes LLMs themselves to generate instruction-following examples. This method begins by selecting a small set of seed instruction-following samples, which are then used to prompt LLMs to produce additional demonstrations in a similar format. Since then, a number of variants have been introduced that increase the complexity of queries (Liu et al., 2023; Zeng et al., 2024), maintain semantic diversity (Ding et al., 2023), scale the synthetic data (Y uan et al., 2023), and use these methods in self-improvement loops (Y uan et al., 2024).


Learning to Reason for Factuality

Chen, Xilun, Kulikov, Ilia, Berges, Vincent-Pierre, Oğuz, Barlas, Shao, Rulin, Ghosh, Gargi, Weston, Jason, Yih, Wen-tau

arXiv.org Artificial Intelligence

Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.


Bridging Offline and Online Reinforcement Learning for LLMs

Lanchantin, Jack, Chen, Angelica, Lan, Janice, Li, Xian, Saha, Swarnadeep, Wang, Tianlu, Xu, Jing, Yu, Ping, Yuan, Weizhe, Weston, Jason E, Sukhbaatar, Sainbayar, Kulikov, Ilia

arXiv.org Artificial Intelligence

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.


NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Yuan, Weizhe, Yu, Jane, Jiang, Song, Padthe, Karthik, Li, Yang, Wang, Dong, Kulikov, Ilia, Cho, Kyunghyun, Tian, Yuandong, Weston, Jason E, Li, Xian

arXiv.org Artificial Intelligence

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.


Post-training an LLM for RAG? Train on Self-Generated Demonstrations

Finlayson, Matthew, Kulikov, Ilia, Bikel, Daneil M., Oguz, Barlas, Chen, Xilun, Pappu, Aasish

arXiv.org Artificial Intelligence

Large language models (LLMs) often struggle with knowledge intensive NLP tasks, such as answering "Who won the latest World Cup?" because the knowledge they learn during training may be insufficient or outdated. Conditioning generation on retrieved documents -- a technique known as retrieval augmented generation (RAG) -- mitigates these shortcomings by allowing the model to leverage in-context information. Practitioners can improve LLM RAG performance by fine-tuning on retrieval-augmented instructions, but must beware that this can cause undesirable model behaviors like hallucinations. We attribute this degradation to the fact that the training data is likely to be out-of-distribution for the model and may suffer from quality issues, such as misalignment between retrievals and target responses (since retrievals are frequently added post-hoc). We propose a recipe for training RAG-enabled LLMs using self-generated demonstrations, thereby avoiding training on out-of-distribution text and integrating retrievals into the LLM responses. We evaluate our method on knowledge intensive question answering (QA) tasks and show that our method teaches LLMs to properly handle in-context retrievals and abstain from questions it will likely get wrong. Compared to conventional RA-IT methods, our method prevents model degradation in non-RAG settings while exhibiting superior QA performance.