Goto

Collaborating Authors

 partial state


ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

arXiv.org Artificial Intelligence

With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.


SOI: Scaling Down Computational Complexity by Estimating Partial States of the Model

Neural Information Processing Systems

Consumer electronics used to follow the miniaturization trend described by Moore's Law. Despite increased processing power in Microcontroller Units (MCUs), MCUs used in the smallest appliances are still not capable of running even moderately big, state-of-the-art artificial neural networks (ANNs) especially in time-sensitive scenarios. In this work, we present a novel method called Scattered Online Inference (SOI) that aims to reduce the computational complexity of ANNs. By applying compression, SOI generates more general inner partial states of ANN, allowing skipping full model recalculation at each inference.


Variational Combinatorial Sequential Monte Carlo for Bayesian Phylogenetics in Hyperbolic Space

arXiv.org Machine Learning

Hyperbolic space naturally encodes hierarchical structures such as phylogenies (binary trees), where inward-bending geodesics reflect paths through least common ancestors, and the exponential growth of neighborhoods mirrors the super-exponential scaling of topologies. This scaling challenge limits the efficiency of Euclidean-based approximate inference methods. Motivated by the geometric connections between trees and hyperbolic space, we develop novel hyperbolic extensions of two sequential search algorithms: Combinatorial and Nested Combinatorial Sequential Monte Carlo (\textsc{Csmc} and \textsc{Ncsmc}). Our approach introduces consistent and unbiased estimators, along with variational inference methods (\textsc{H-Vcsmc} and \textsc{H-Vncsmc}), which outperform their Euclidean counterparts. Empirical results demonstrate improved speed, scalability and performance in high-dimensional phylogenetic inference tasks.


SOI: Scaling Down Computational Complexity by Estimating Partial States of the Model

arXiv.org Artificial Intelligence

Consumer electronics used to follow the miniaturization trend described by Moore's Law. Despite increased processing power in Microcontroller Units (MCUs), MCUs used in the smallest appliances are still not capable of running even moderately big, state-of-the-art artificial neural networks (ANNs) especially in time-sensitive scenarios. In this work, we present a novel method called Scattered Online Inference (SOI) that aims to reduce the computational complexity of ANNs. SOI leverages the continuity and seasonality of time-series data and model predictions, enabling extrapolation for processing speed improvements, particularly in deeper layers. By applying compression, SOI generates more general inner partial states of ANN, allowing skipping full model recalculation at each inference.


Stochastic Multi-round Submodular Optimization with Budget

arXiv.org Artificial Intelligence

In this work we study the problem of {\em Stochastic Budgeted Multi-round Submodular Maximization} (SBMSm), in which we would like to adaptively maximize the sum over multiple rounds of the value of a monotone and submodular objective function defined on a subset of items, subject to the fact that the values of this function depend on the realization of stochastic events and the number of items that we can select over all rounds is limited by a given budget. This problem extends, and generalizes to multiple round settings, well-studied problems such as (adaptive) influence maximization and stochastic probing. We first show that, if the number of items and stochastic events is somehow bounded, there is a polynomial time dynamic programming algorithm for SBMSm. Then, we provide a simple greedy approximation algorithm for SBMSm, that first non-adaptively allocates the budget to be spent at each round, and then greedily and adaptively maximizes the objective function by using the budget assigned at each round. Such algorithm guarantees a $(1-1/e-\epsilon)$-approximation to the optimal adaptive value. Finally, by introducing a metric called {\em budget-adaptivity gap}, we measure how much an optimal policy for SBMSm, that is adaptive in both the budget allocation and item selection, is better than an optimal partially adaptive policy that, as in our greedy algorithm, determined the budget allocation in advance. We show a tight bound of $e/(e-1)$ on the budget-adaptivity gap, and this result implies that our greedy algorithm guarantees the best approximation among all partially adaptive policies.


Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics

arXiv.org Artificial Intelligence

Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.


Policy-Space Search: Equivalences, Improvements, and Compression

arXiv.org Artificial Intelligence

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. A* with Non-Determinism (AND*) (Messa and Pereira, 2023) is a FOND planner that generalizes A* (Hart et al., 1968) for FOND planning. It searches for a solution policy by performing an explicit heuristic search on the policy space of the FOND task. In this paper, we study and improve the performance of the policy-space search performed by AND*. We present a polynomial-time procedure that constructs a solution policy given just the set of states that should be mapped. This procedure, together with a better understanding of the structure of FOND policies, allows us to present three concepts of equivalences between policies. We use policy equivalences to prune part of the policy search space, making AND* substantially more effective in solving FOND tasks. We also study the impact of taking into account structural state-space symmetries to strengthen the detection of equivalence policies and the impact of performing the search with satisficing techniques. We apply a recent technique from the group theory literature to better compute structural state-space symmetries. Finally, we present a solution compressor that, given a policy defined over complete states, finds a policy that unambiguously represents it using the minimum number of partial states. AND* with the introduced techniques generates, on average, two orders of magnitude fewer policies to solve FOND tasks. These techniques allow explicit policy-space search to be competitive in terms of both coverage and solution compactness with other state-of-the-art FOND planners.


Understanding Sample Generation Strategies for Learning Heuristic Functions in Classical Planning

arXiv.org Artificial Intelligence

We study the problem of learning good heuristic functions for classical planning tasks with neural networks based on samples represented by states with their cost-to-goal estimates. The heuristic function is learned for a state space and goal condition with the number of samples limited to a fraction of the size of the state space, and must generalize well for all states of the state space with the same goal condition. Our main goal is to better understand the influence of sample generation strategies on the performance of a greedy best-first heuristic search (GBFS) guided by a learned heuristic function. In a set of controlled experiments, we find that two main factors determine the quality of the learned heuristic: which states are included in the sample set and the quality of the cost-to-goal estimates. These two factors are dependent: having perfect cost-to-goal estimates is insufficient if the samples are not well distributed across the state space. We also study other effects, such as adding samples with high-value estimates. Based on our findings, we propose practical strategies to improve the quality of learned heuristics: three strategies that aim to generate more representative states and two strategies that improve the cost-to-goal estimates. Our practical strategies almost double the mean coverage of a GBFS algorithm guided by a learned heuristic.


Semantically-enhanced Deep Collision Prediction for Autonomous Navigation using Aerial Robots

arXiv.org Artificial Intelligence

Abstract-- This paper contributes a novel and modularized learning-based method for aerial robots navigating cluttered environments containing hard-to-perceive thin obstacles without assuming access to a map or the full pose estimation of the robot. The proposed solution builds upon a semantically-enhanced Variational Autoencoder that is trained with both real-world and simulated depth images to compress the input data, while preserving semantically-labeled thin obstacles and handling invalid pixels in the depth sensor's output. This compressed representation, in addition to the robot's partial state involving its linear/angular velocities and its attitude are then utilized to train an uncertainty-aware 3D Collision Prediction Network in simulation to predict collision scores for candidate action sequences in a predefined motion primitives library. A set of simulation and experimental studies in cluttered environments with various sizes and types of obstacles, including multiple hard-to-perceive thin objects, were conducted to evaluate the performance of the proposed method and compare against an end-to-end trained baseline. The results demonstrate the benefits of the proposed semantically-enhanced deep collision prediction for learning-based autonomous navigation.


Kinodynamic FMT* with Dimensionality Reduction Heuristics and Neural Network Controllers

arXiv.org Artificial Intelligence

This paper proposes a new sampling-based kinodynamic motion planning algorithm, called FMT*PFF, for nonlinear systems. It exploits the novel idea of dimensionality reduction using partial-final-state-free (PFF) optimal controllers.With the proposed dimensionality reduction heuristic, the search space is restricted within a subspace, thus faster convergence is achieved compared to a regular kinodynamic FMT*. The dimensionality reduction heuristic can be viewed as a sampling strategy and asymptotic optimality is preserved when combined with uniform full-state sampling. Another feature of FMT*PFF is the ability to deal with a steering function with inexact steering, which is vital when using learning-based steering functions. Learning-based methods allow us to solve the steering problem for nonlinear systems efficiently. However, learning-based methods often fail to reach the exact goal state. For nonlinear systems, we train a neural network controller using supervised learning to generate the steering commands. We show that FMT*PFF with a learning-based steering function is efficient and generates dynamically feasible motion plans. We compare our algorithm with previous algorithms and show superior performance in various simulations.