Goto

Collaborating Authors

 Technology


DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Neural Information Processing Systems

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of questionlevel difficulty bias arising from its group relative advantage function. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning: increasing the scores of positive answers while decreasing those of negative ones. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for a 1.5B model.1


State-Covering Trajectory Stitching for Diffusion Planners

Neural Information Processing Systems

Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments. Our code is available at https://github.com/leekwoon/scots/


On Evaluating Policies for Robust POMDPs

Neural Information Processing Systems

Robust partially observable Markov decision processes (RPOMDPs) model sequential decision-making problems under partial observability, where an agent must be robust against a range of dynamics. RPOMDPs can be viewed as a two-player game between an agent, who selects actions, and nature, who adversarially selects the dynamics. Evaluating an agent policy requires finding an adversarial nature policy, which is computationally challenging. In this paper, we advance the evaluation of agent policies for RPOMDPs in three ways. First, we discuss suitable benchmarks.


Appendix412 Table of Contents

Neural Information Processing Systems

Starting from Grobid's XML output, peS2o filters papers that are too short, have453 incorrect metadata, are in languages other than English, and contain OCR errors using a combination454 of heuristic-and model-based filtering steps. We refer the reader to the datasheet and code for more455 details on this processing pipeline.456 The subset of peS2o included in the Common Pile starts from v3 of the corpus, which contains457 documents from January 1, 1970 to October 6, 2024. We retain full-text papers with CCBY,458 CCBY-SA, or CC0 licenses, or that have been labeled as public domain; metadata is provided459 by the Semantic Scholar APIs [85]. After filtering, this set contains 6.3 million papers, or 35.7460 billion whitespace-separated segments.


The Common Pile v0.1: An8TBDataset of Public Domain and Openly Licensed Text

Neural Information Processing Systems

Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.


Solving Asymmetric Traveling Salesman Problem via Trace-Guided Cost Augmentation

Neural Information Processing Systems

The Asymmetric Traveling Salesman Problem (ATSP) is one of the most fundamental and notoriously challenging problems in combinatorial optimization. We propose a novel continuous relaxation framework for ATSP that leverages differentiable constraints to encourage acyclic structures and valid permutations.


Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

Neural Information Processing Systems

Hierarchical optimization refers to problems with interdependent decision variables and objectives, such as minimax and bilevel formulations. While various algorithms have been proposed, existing methods and analyses lack adaptivity in stochastic optimization settings: they cannot achieve optimal convergence rates across a wide spectrum of gradient noise levels without prior knowledge of the noise magnitude. In this paper, we propose novel adaptive algorithms for two important classes of stochastic hierarchical optimization problems: nonconvex-strongly-concave minimax optimization and nonconvex-strongly-convex bilevel optimization. Our algorithms achieve sharp convergence rates of eO(1/ T + ฯƒ/T1/4) in T iterations for the gradient norm, where ฯƒ is an upper bound on the stochastic gradient noise. Notably, these rates are obtained without prior knowledge of the noise level, thereby enabling automatic adaptivity in both low and high-noise regimes. To our knowledge, this work provides the first adaptive and sharp convergence guarantees for stochastic hierarchical optimization. Our algorithm design combines the momentum normalization technique with novel adaptive parameter choices. Extensive experiments on synthetic and deep learning tasks demonstrate the effectiveness of our proposed algorithms.


528d56195a2c77c808494c86fa7c77ad-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

A.1 Dataset Examples450 In this section of the appendix, we present a detailed overview of several representative tasks from451 each category included in REASONINGGYM. For each task, we describe its structure, complexity452 parameters, and provide examples.453 A.1.1 complex_arithmetic(Algebra)454 Find the solution of an arithmetic operation involving complex numbers.455 The spiral order is clockwise, starting from the top-left corner. Predict the corresponding output grid by applying the rule you found.


REASONINGGYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Neural Information Processing Systems

This comple procedural xity, generation unlike most approach previous allo reasoning ws for continuous datasets, which evaluation are typically across >o varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both eFigletvaluatingfonandts reinforcement learning of reasoning models. Question: What word does this say?


PanCap Joint Panoptic Segmentation and Grounded Captions for Fine Understanding and Generation

Neural Information Processing Systems

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, regionlevel captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of visionlanguage models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. It establishes a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.