Goto

Collaborating Authors

 Search


Exhaustive-Serve-Longest Control for Multi-robot Scheduling Systems

arXiv.org Artificial Intelligence

We study online task allocation for multi-robot, multi-queue systems with stochastic arrivals and switching delays. Time is slotted; each location can host at most one robot per slot; service consumes one slot; switching between locations incurs a one-slot travel delay; and arrivals are independent Bernoulli processes. We formulate a discounted-cost Markov decision process and propose Exhaustive-Serve-Longest (ESL), a simple real-time policy that serves exhaustively when the current location is nonempty and, when idle, switches to a longest unoccupied nonempty location, and we prove the optimality of this policy. As baselines, we tune a fixed-dwell cyclic policy via a discrete-time delay expression and implement a first-come-first-serve policy. Across server-to-location ratios and loads, ESL consistently yields lower discounted holding cost and smaller mean queue lengths, with action-time fractions showing more serving and restrained switching. Its simplicity and robustness make ESL a practical default for real-time multi-robot scheduling systems.


Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved significant advances in reasoning tasks. A key approach is tree-based search with verifiers, which expand candidate reasoning paths and use reward models to guide pruning and selection. Although effective in improving accuracy, these methods are not optimal in terms of efficiency: they perform simple decomposition on the reasoning process, but ignore the planning-execution nature of tasks such as math reasoning or code generation. This results in inefficient exploration of reasoning process. To address this, we propose a dual-phase test-time scaling framework that explicitly separates reasoning into planning and execution, and performs search over the two phases individually. Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately. We further introduce a dynamic budget allocation mechanism that adaptively redistributes sampling effort based on reward feedback, allowing early stopping on confident steps and reallocation of computation to more challenging parts of the reasoning process. Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing redundant computation.


Boolean Satisfiability via Imitation Learning

arXiv.org Artificial Intelligence

We propose ImitSA T, a branching policy for conflict-driven clause learning (CDCL) solvers based on imitation learning for the Boolean satisfiability problem (SA T). Unlike previous methods that predict instance-level signals to improve CDCL branching indirectly, or rely on reinforcement learning and insufficient CDCL information to enhance branching, ImitSA T learns from expert KeyTrace that collapses a full run into the sequence of surviving decisions. Replaying a KeyTrace on the same instance is nearly conflict-free, providing dense decision-level supervision and directly reducing propagations--the dominant contributor to wall-clock time. This prefix-conditioned supervision enables ImitSA T to reproduce high-quality branches without exploration, yielding faster convergence, stable training, and seamless integration into CDCL. Extensive experiments demonstrate that ImitSA T reduces propagation counts and runtime, outperforming state-of-the-art learned approaches. The Boolean satisfiability (SA T) problem is a cornerstone of theoretical computer science and artificial intelligence (Cook, 1971; Karp, 1972). Beyond its foundational role, SA T serves as the computational backbone of numerous applications, including formal verification, planning, and combinatorial optimization. Modern solvers for SA T are dominated by the conflict-driven clause learning (CDCL) framework (Silva & Sakallah, 1996; Biere et al., 2009), which has scaled to industrial benchmarks of immense complexity. A CDCL run interleaves branching, unit propagation, and conflict analysis. Among these components, the branching rule largely determines the search trajectory, while unit propagation often dominates runtime (Zhang & Malik, 2002; Davis et al., 2008; Moskewicz et al., 2001). As a result, more informed branching decisions can translate directly into faster solving.


Learning to Condition: A Neural Heuristic for Scalable MPE Inference

arXiv.org Artificial Intelligence

We introduce learning to condition (L2C), a scalable, data-driven framework for accelerating Most Probable Explanation (MPE) inference in Probabilistic Graphical Models (PGMs), a fundamentally intractable problem. L2C trains a neural network to score variable-value assignments based on their utility for conditioning, given observed evidence. To facilitate supervised learning, we develop a scalable data generation pipeline that extracts training signals from the search traces of existing MPE solvers. The trained network serves as a heuristic that integrates with search algorithms, acting as a conditioning strategy prior to exact inference or as a branching and node selection policy within branch-and-bound solvers. We evaluate L2C on challenging MPE queries involving high-treewidth PGMs. Experiments show that our learned heuristic significantly reduces the search space while maintaining or improving solution quality over state-of-the-art methods.


Value-Guided Search for Efficient Chain-of-Thought Reasoning

arXiv.org Artificial Intelligence

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.


Planning Jerk-Optimized Trajectory with Discrete-Time Constraints for Redundant Robots

arXiv.org Artificial Intelligence

--We present a method for effectively planning the motion trajectory of robots in manufacturing tasks, the tool-paths of which are usually complex and have a large number of discrete time constraints as waypoints. Kinematic redundancy also exists in these robotic systems. The jerk of motion is optimized in our trajectory planning method at the meanwhile of fabrication process to improve the quality of fabrication. Our method is based on a sampling strategy and consists of two major parts. After determining an initial path by graph-search, a greedy algorithm is adopted to optimize a path by locally applying adaptive filers in the regions with large jerks. The filtered result is obtained by numerical optimization. In order to achieve efficient computation, an adaptive sampling method is developed for learning a collision-indication function that is represented as a support-vector machine. Applications in robot-assisted 3D printing are given in this paper to demonstrate the functionality of our approach. Abstract --In robot-assisted manufacturing applications, robotic arms are employed to realize the motion of workpieces (or machining tools) specified as a sequence of waypoints with the positions of tool tip and the tool orientations constrained. The required degree-of-freedom (DOF) is often less than the robotic hardware system (e.g., a robotic arm has 6-DOF). Specifically, rotations of the workpiece around the axis of a tool can be arbitrary (see Figure 1 for an example). By using this redundancy - i.e., there are many possible poses of a robotic arm to realize a given waypoint, the trajectory of robots can be optimized to consider the performance of motion in velocity, acceleration and jerk in the joint space. In addition, when fabricating complex models each tool-path can have a large amount of waypoints. It is crucial for a motion planning algorithm to compute a smooth and collision-free trajectory of robot to improve fabrication quality. The time taken by the planning algorithm should not significantly lengthen the total manufacturing time; ideally it would remain hidden as computing motions for a layer can be done while the previous layer is printing. The method presented in this paper provides an efficient framework to tackle this problem. The framework has been well tested on our robot-assisted additive manufacturing system to demonstrate its effectiveness and can be generally applied to other robot-assisted manufacturing systems.


Hybrid Quantum-Classical Optimisation of Traveling Salesperson Problem

arXiv.org Artificial Intelligence

The Traveling Salesperson Problem (TSP), a quintessential NP-hard combinatorial optimisation challenge, is vital for logistics and network design but limited by exponential complexity in large instances. We propose a hybrid quantum-classical framework integrating variational quantum eigensolver (VQE) optimisation with classical machine learning, using K-means clustering for problem decomposition and a RandomForestRegressor for path refinement. Evaluated on 80 European cities (from 4 to 80 cities, 38,500 samples in total) via Qiskit's AerSimulator and ibm_kyiv 127-qubit backend, the hybrid approach outperforms quantum-only methods, achieving an approximation ratio of 1.0287 at 80 cities, a 47.5% improvement over quantum-only's 1.9614, nearing the classical baseline. Machine learning reduces variability in tour distances (interquartile range, IQR - the spread of the middle 50% of results relative to the median - from 0.06 to 0.04), enhancing stability despite noisy intermediate-scale quantum (NISQ) noise. This framework underscores hybrid strategies' potential for scalable TSP optimisation, with future hardware advancements promising practical quantum advantages.


How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

Neural Information Processing Systems

We consider a popular problem in finance, option pricing, through the lens of an online learning game between Nature and an Investor. In the Black-Scholes option pricing model from 1973, the Investor can continuously hedge the risk of an option by trading the underlying asset, assuming that the asset's price fluctuates according to Geometric Brownian Motion (GBM). We consider a worst-case model, in which Nature chooses a sequence of price fluctuations under a cumulative quadratic volatility constraint, and the Investor can make a sequence of hedging decisions. Our main result is to show that the value of our proposed game, which is the regret'' of hedging strategy, converges to the Black-Scholes option price. We use significantly weaker assumptions than previous work---for instance, we allow large jumps in the asset price---and show that the Black-Scholes hedging strategy is near-optimal for the Investor even in this non-stochastic framework."


Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Neural Information Processing Systems

Monte-Carlo tree search is drawing great interest in the domain of planning under uncertainty, particularly when little or no domain knowledge is available. One of the central problems is the trade-off between exploration and exploitation. In this paper we present a novel Bayesian mixture modelling and inference based Thompson sampling approach to addressing this dilemma. The proposed Dirichlet-NormalGamma MCTS (DNG-MCTS) algorithm represents the uncertainty of the accumulated reward for actions in the MCTS search tree as a mixture of Normal distributions and inferences on it in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions. Thompson sampling is used to select the best action at each decision node. Experimental results show that our proposed algorithm has achieved the state-of-the-art comparing with popular UCT algorithm in the context of online planning for general Markov decision processes.


Embed and Project: Discrete Sampling with Universal Hashing

Neural Information Processing Systems

We consider the problem of sampling from a probability distribution defined over a high-dimensional discrete set, specified for instance by a graphical model. We propose a sampling algorithm, called PAWS, based on embedding the set into a higher-dimensional space which is then randomly projected using universal hash functions to a lower-dimensional subspace and explored using combinatorial search methods. Our scheme can leverage fast combinatorial optimization tools as a blackbox and, unlike MCMC methods, samples produced are guaranteed to be within an (arbitrarily small) constant factor of the true probability distribution. We demonstrate that by using state-of-the-art combinatorial search tools, PAWS can efficiently sample from Ising grids with strong interactions and from software verification instances, while MCMC and variational methods fail in both cases.