Goto

Collaborating Authors

 Technology


Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Neural Information Processing Systems

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency--requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline.


Combining Cost Constrained Runtime Monitors for AI Safety

Neural Information Processing Systems

Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to efficiently combine multiple runtime monitors into a single monitoring protocol. The protocol's objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors' performance and cost as given, we develop an algorithm to find the best protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a code review setting. We also show that combining two monitors can Pareto dominate using either monitor alone. Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior in cost-sensitive settings.


Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Neural Information Processing Systems

Test-time scaling enhances large language model performance by allocating additional compute resources during decoding. Best-of-$N$ (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost-performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating $N$ full samples consumes substantial GPU memory, reducing inference capacity under limited resources.


TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation

Neural Information Processing Systems

Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose \textbf{T}ime-\textbf{V}ariant Convolutional Filters for Sequential \textbf{Rec}ommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49\%.


MoleBridge: Synthetic Space Projecting with Discrete Markov Bridges

Neural Information Processing Systems

Molecular synthetic space projecting is a critical technique in de novo molecular design, which aims to rectify molecules without synthesizability guarantee by converting them into synthetic postfix notations. However, the vast synthesizable chemical space and the discrete data modalities involved pose significant challenges to postfix notation conversion benchmarking. In this paper, we exploit conditional probability transitions in discrete state space and introduce MoleBridge, a deep generative model built on the Markov bridge approach for designing postfix notations of molecular synthesis pathways. MoleBridge consists of two iterative optimizations: i) Autoregressive extending of notation tokens from molecular graphs, and ii) generation of discrete reaction postfix notations through Markov bridge, where noisy token blocks are progressively denoised over multi-step iterations. For the challenging second iteration, which demands sensitivity to incorrect generative probability paths within intricate chemical spaces, we employ a thinking and denoising separation approach to denoise. Empirically, we find that MoleBridge is capable of accurately predicting synthesis pathways while exhibiting excellent performance in a variety of application scenarios.


Progressive Data Dropout: An Embarrassingly Simple Approach to Train Faster

Neural Information Processing Systems

The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset.


MOF-BFN: Metal-Organic Frameworks Structure Prediction via Bayesian Flow Networks

Neural Information Processing Systems

Metal-Organic Frameworks (MOFs) have attracted considerable attention due to their unique properties including high surface area and tunable porosity, and promising applications in catalysis, gas storage, and drug delivery. Structure prediction for MOFs is a challenging task, as these frameworks are intrinsically periodic and hierarchically organized, where the entire structure is assembled from building blocks like metal nodes and organic linkers. To address this, we introduce MOF-BFN, a novel generative model for MOF structure prediction based on Bayesian Flow Networks (BFNs). Given the local geometry of building blocks, MOF-BFN jointly predicts the lattice parameters, as well as the positions and orientations of all building blocks within the unit cell. In particular, the positions are modelled in the fractional coordinate system to naturally incorporate the periodicity. Meanwhile, the orientations are modeled as unit quaternions sampled from learned Bingham distributions via the proposed Bingham BFN, enabling effective orientation generation on the 4D unit hypersphere. Experimental results demonstrate that MOF-BFN achieves state-of-the-art performance across multiple tasks, including structure prediction, geometric property evaluation, and de novo generation, offering a promising tool for designing complex MOF materials.


Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Neural Information Processing Systems

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce \textsc{RAGuard}, the first benchmark to evaluate the robustness of RAG systems against \textit{misleading} retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: \textit{supporting}, \textit{misleading}, and \textit{unrelated}, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, \textsc{RAGuard} is the first benchmark to systematically assess the robustness of the RAG against misleading evidence.We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications.


STAR-Bets: Sequential TArget-Recalculating Bets for Tighter Confidence Intervals

Neural Information Processing Systems

The construction of confidence intervals for the mean of a bounded random variable is a classical problem in statistics with numerous applications in machine learning and virtually all scientific fields. In particular, obtaining the tightest possible confidence intervals is vital every time the sampling of the random variables is expensive. The current state-of-the-art method to construct confidence intervals is by using betting algorithms. This is a very successful approach for deriving optimal confidence sequences, even matching the rate of law of iterated logarithms. However, in the fixed horizon setting, these approaches are either sub-optimal or based on heuristic solutions with strong empirical performance but without a finite-time guarantee. Hence, no betting-based algorithm guaranteeing the optimal $\mathcal{O}(\sqrt{\frac{\sigma^2\log\frac1\delta}{n}})$ width of the confidence intervals are known.


KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Neural Information Processing Systems

Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models (LLMs), even without supervised fine-tuning (SFT). However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions. To address this limitation, we propose Key-token Advantage Estimation (KTAE)--a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.