Goto

Collaborating Authors

 Markov Models


On Policy Evaluation Algorithms in Distributional Reinforcement Learning

arXiv.org Machine Learning

We introduce a novel class of algorithms to efficiently approximate the unknown return distributions in policy evaluation problems from distributional reinforcement learning (DRL). The proposed distributional dynamic programming algorithms are suitable for underlying Markov decision processes (MDPs) having an arbitrary probabilistic reward mechanism, including continuous reward distributions with unbounded support being potentially heavy-tailed. For a plain instance of our proposed class of algorithms we prove error bounds, both within Wasserstein and Kolmogorov--Smirnov distances. Furthermore, for return distributions having probability density functions the algorithms yield approximations for these densities; error bounds are given within supremum norm. We introduce the concept of quantile-spline discretizations to come up with algorithms showing promising results in simulation experiments. While the performance of our algorithms can rigorously be analysed they can be seen as universal black box algorithms applicable to a large class of MDPs. We also derive new properties of probability metrics commonly used in DRL on which our quantitative analysis is based.


Optimizing Agricultural Order Fulfillment Systems: A Hybrid Tree Search Approach

arXiv.org Artificial Intelligence

The importance of these seed stocks is underscored by the critical need for timely fulfillment of seed orders to meet specific planting windows, often mandated by the seasonal growth cycles of different crops. Failure to meet these strict timelines can lead to a host of downstream issues, including suboptimal crop yields and financial loss [1]. Figure 1: Overview of the centralized seed fulfillment process. The process begins with the arrival of seed stocks from multiple sites with stochastic, a priori unknown arrival distributions and ends with the fulfillment of orders with different deadlines and quantities. Our proposed adaptive adaptive hybrid tree search approach provides an efficient solution to the wave scheduling problem, optimizing the process of order fulfillment. Order fulfillment in industries such as e-commerce [2] and retail [3] often involve centralized fulfillment centers that simultaneously process arriving inventory and fulfill orders based on their deadlines. The fulfillment process with large catalogs often handle a batch of orders, hereinafter referred to as wave, together using automated sortation systems [4]. The supply chain in these sectors is typically well-established, with known inventory quantities and deterministic restock times. The problem of optimally scheduling waves to maximize fulfillment efficiency is addressed using traditional operations research and optimization techniques [5], [6] as order deadlines and inventory levels are known a priori or can be forecasted with low uncertainty.


Dimension-reduced Reconstruction Map Learning for Parameter Estimation in Likelihood-Free Inference Problems

arXiv.org Machine Learning

Many application areas rely on models that can be readily simulated but lack a closed-form likelihood, or an accurate approximation under arbitrary parameter values. Existing parameter estimation approaches in this setting are generally approximate. Recent work on using neural network models to reconstruct the mapping from the data space to the parameters from a set of synthetic parameter-data pairs suffers from the curse of dimensionality, resulting in inaccurate estimation as the data size grows. We propose a dimension-reduced approach to likelihood-free estimation which combines the ideas of reconstruction map estimation with dimension-reduction approaches based on subject-specific knowledge. We examine the properties of reconstruction map estimation with and without dimension reduction and explore the trade-off between approximation error due to information loss from reducing the data dimension and approximation error. Numerical examples show that the proposed approach compares favorably with reconstruction map estimation, approximate Bayesian computation, and synthetic likelihood estimation.


Track-MDP: Reinforcement Learning for Target Tracking with Controlled Sensing

arXiv.org Machine Learning

State of the art methods for target tracking with sensor management (or controlled sensing) are model-based and are obtained through solutions to Partially Observable Markov Decision Process (POMDP) formulations. In this paper a Reinforcement Learning (RL) approach to the problem is explored for the setting where the motion model for the object/target to be tracked is unknown to the observer. It is assumed that the target dynamics are stationary in time, the state space and the observation space are discrete, and there is complete observability of the location of the target under certain (a priori unknown) sensor control actions. Then, a novel Markov Decision Process (MDP) rather than POMDP formulation is proposed for the tracking problem with controlled sensing, which is termed as Track-MDP. In contrast to the POMDP formulation, the Track-MDP formulation is amenable to an RL based solution. It is shown that the optimal policy for the Track-MDP formulation, which is approximated through RL, is guaranteed to track all significant target paths with certainty. The Track-MDP method is then compared with the optimal POMDP policy, and it is shown that the infinite horizon tracking reward of the optimal Track-MDP policy is the same as that of the optimal POMDP policy. In simulations it is demonstrated that Track-MDP based RL leads to a policy that can track the target with high accuracy.


Optimistic Q-learning for average reward and episodic reinforcement learning

arXiv.org Artificial Intelligence

We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the expected time to visit some frequent state $s_0$ is finite and upper bounded by $H$. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time {\it for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is to introduce an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. We show that under the given assumption, the $\overline{L}$ operator has a strict contraction (in span) even in the average reward setting. Our algorithm design then uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Therefore, we provide a unified view of regret minimization in episodic and non-episodic settings that may be of independent interest.


Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review

arXiv.org Artificial Intelligence

This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions. While diffusion models are widely known to provide excellent generative modeling capability, practical applications in domains such as biology require generating samples that maximize some desired metric (e.g., translation efficiency in RNA, docking score in molecules, stability in protein). In these cases, the diffusion model can be optimized not only to generate realistic samples but also to explicitly maximize the measure of interest. Such methods are based on concepts from reinforcement learning (RL). We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning, tailored specifically for fine-tuning diffusion models. We aim to explore fundamental aspects such as the strengths and limitations of different RL-based fine-tuning algorithms across various scenarios, the benefits of RL-based fine-tuning compared to non-RL-based approaches, and the formal objectives of RL-based fine-tuning (target distributions). Additionally, we aim to examine their connections with related topics such as classifier guidance, Gflownets, flow-based diffusion models, path integral control theory, and sampling from unnormalized distributions such as MCMC. The code of this tutorial is available at https://github.com/masa-ue/RLfinetuning_Diffusion_Bioseq


Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstraction

arXiv.org Artificial Intelligence

How can a scientist use a Reinforcement Learning (RL) algorithm to design experiments over a dynamical system's state space? In the case of finite and Markovian systems, an area called Active Exploration (AE) relaxes the optimization problem of experiments design into Convex RL, a generalization of RL admitting a wider notion of reward. Unfortunately, this framework is currently not scalable and the potential of AE is hindered by the vastness of experiment spaces typical of scientific discovery applications. However, these spaces are often endowed with natural geometries, e.g., permutation invariance in molecular design, that an agent could leverage to improve the statistical and computational efficiency of AE. To achieve this, we bridge AE and MDP homomorphisms, which offer a way to exploit known geometric structures via abstraction. Towards this goal, we make two fundamental contributions: we extend MDP homomorphisms formalism to Convex RL, and we present, to the best of our knowledge, the first analysis that formally captures the benefit of abstraction via homomorphisms on sample efficiency. Ultimately, we propose the Geometric Active Exploration (GAE) algorithm, which we analyse theoretically and experimentally in environments motivated by problems in scientific discovery.


Scalable Monte Carlo for Bayesian Learning

arXiv.org Machine Learning

This book aims to provide a graduate-level introduction to advanced topics in Markov chain Monte Carlo (MCMC) algorithms, as applied broadly in the Bayesian computational context. Most, if not all of these topics (stochastic gradient MCMC, non-reversible MCMC, continuous time MCMC, and new techniques for convergence assessment) have emerged as recently as the last decade, and have driven substantial recent practical and theoretical advances in the field. A particular focus is on methods that are scalable with respect to either the amount of data, or the data dimension, motivated by the emerging high-priority application areas in machine learning and AI.


Driving pattern interpretation based on action phases clustering

arXiv.org Artificial Intelligence

Current approaches to identifying driving heterogeneity face challenges in comprehending fundamental patterns from the perspective of underlying driving behavior mechanisms. The concept of Action phases was proposed in our previous work, capturing the diversity of driving characteristics with physical meanings. This study presents a novel framework to further interpret driving patterns by classifying Action phases in an unsupervised manner. In this framework, a Resampling and Downsampling Method (RDM) is first applied to standardize the length of Action phases. Then the clustering calibration procedure including ''Feature Selection'', ''Clustering Analysis'', ''Difference/Similarity Evaluation'', and ''Action phases Re-extraction'' is iteratively applied until all differences among clusters and similarities within clusters reach the pre-determined criteria. Application of the framework using real-world datasets revealed six driving patterns in the I80 dataset, labeled as ''Catch up'', ''Keep away'', and ''Maintain distance'', with both ''Stable'' and ''Unstable'' states. Notably, Unstable patterns are more numerous than Stable ones. ''Maintain distance'' is the most common among Stable patterns. These observations align with the dynamic nature of driving. Two patterns ''Stable keep away'' and ''Unstable catch up'' are missing in the US101 dataset, which is in line with our expectations as this dataset was previously shown to have less heterogeneity. This demonstrates the potential of driving patterns in describing driving heterogeneity. The proposed framework promises advantages in addressing label scarcity in supervised learning and enhancing tasks such as driving behavior modeling and driving trajectory prediction.


Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.