Search
06d5ae105ea1bea4d800bc96491876e9-AuthorFeedback.pdf
We thank all the reviewers for the constructive comments. We address the major concerns below. Reproducibility: 1) learning to draft details; 2) feature details; 3) discussions on the computing resources used. The search tree is updated based on four steps of MCTS. The learning rate is set to 0.001 with Adam.
Parallel Heuristic Search as Inference for Actor-Critic Reinforcement Learning Models
Yang, Hanlan, Mishani, Itamar, Pivetti, Luca, Kingston, Zachary, Likhachev, Maxim
Actor-Critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose \pachs{} (\textit{P}arallel \textit{A}ctor-\textit{C}ritic \textit{H}euristic \textit{S}earch), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search -- actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.
Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Ryu, Sangwon, Do, Heejin, Kim, Yunsu, Lee, Gary Geunbae, Ok, Jungseul
Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
A Review on Single-Problem Multi-Attempt Heuristic Optimization
Echevarrieta, Judith, Arza, Etor, Pรฉrez, Aritz, Ceberio, Josu
In certain real-world optimization scenarios, practitioners are not interested in solving multiple problems but rather in finding the best solution to a single, specific problem. When the computational budget is large relative to the cost of evaluating a candidate solution, multiple heuristic alternatives can be tried to solve the same given problem, each possibly with a different algorithm, parameter configuration, initialization, or stopping criterion. The sequential selection of which alternative to try next is crucial for efficiently identifying the one that provides the best possible solution across multiple attempts. Despite the relevance of this problem in practice, it has not yet been the exclusive focus of any existing review. Several sequential alternative selection strategies have been proposed in different research topics, but they have not been comprehensively and systematically unified under a common perspective. This work presents a focused review of single-problem multi-attempt heuristic optimization. It brings together suitable strategies to this problem that have been studied separately through algorithm selection, parameter tuning, multi-start and resource allocation. These strategies are explained using a unified terminology within a common framework, which supports the development of a taxonomy for systematically organizing and classifying them.
Alignment-Aware Decoding
Berdoz, Frรฉdรฉric, Lanzendรถrfer, Luca A., Caky, Renรฉ, Wattenhofer, Roger
Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited. Large language models (LLMs) are the backbone of modern natural language processing, powering applications ranging from open-ended dialogue to complex reasoning tasks. Despite their impressive capabilities, aligning these models with human preferences remains a central challenge. Misaligned models can produce harmful, biased, or simply unhelpful outputs, motivating a growing body of work on alignment, i.e., the process of training models to better reflect human values and preferences (Ziegler et al., 2019; Ouyang et al., 2022; Amodei et al., 2016).
RL-Guided Data Selection for Language Model Finetuning
Jha, Animesh, Gupta, Harshit, Nandi, Ananjan
Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5\%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 \times$, highlighting the promise of RL-guided data selection.
Machine Learning Algorithms for Improving Black Box Optimization Solvers
Kimiaei, Morteza, Kungurtsev, Vyacheslav
Black-box optimization (BBO) addresses problems where objectives are accessible only through costly queries without gradients or explicit structure. Classical derivative-free methods -- line search, direct search, and model-based solvers such as Bayesian optimization -- form the backbone of BBO, yet often struggle in high-dimensional, noisy, or mixed-integer settings. Recent advances use machine learning (ML) and reinforcement learning (RL) to enhance BBO: ML provides expressive surrogates, adaptive updates, meta-learning portfolios, and generative models, while RL enables dynamic operator configuration, robustness, and meta-optimization across tasks. This paper surveys these developments, covering representative algorithms such as NNs with the modular model-based optimization framework (mlrMBO), zeroth-order adaptive momentum methods (ZO-AdaMM), automated BBO (ABBO), distributed block-wise optimization (DiBB), partition-based Bayesian optimization (SPBOpt), the transformer-based optimizer (B2Opt), diffusion-model-based BBO, surrogate-assisted RL for differential evolution (Surr-RLDE), robust BBO (RBO), coordinate-ascent model-based optimization with relative entropy (CAS-MORE), log-barrier stochastic gradient descent (LB-SGD), policy improvement with black-box (PIBB), and offline Q-learning with Mamba backbones (Q-Mamba). We also review benchmark efforts such as the NeurIPS 2020 BBO Challenge and the MetaBox framework. Overall, we highlight how ML and RL transform classical inexact solvers into more scalable, robust, and adaptive frameworks for real-world optimization.