Markov Models
Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey
Varangot-Reille, Clovis, Bouvard, Christophe, Gourru, Antoine, Ciancone, Mathieu, Schaeffer, Marion, Jacquenet, François
Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (e.g. GPT-4) trained on very large multi-topic corpora can perform well in a variety of tasks. They require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.
Synthesis of Model Predictive Control and Reinforcement Learning: Survey and Classification
Reiter, Rudolf, Hoffmann, Jasper, Reinhardt, Dirk, Messerer, Florian, Baumgärtner, Katrin, Sawant, Shamburaj, Boedecker, Joschka, Diehl, Moritz, Gros, Sebastien
The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and different requirements. Various technical discrepancies, particularly the role of an environment model as part of the algorithm, lead to methodologies with nearly complementary advantages. Due to their orthogonal benefits, research interest in combination methods has recently increased significantly, leading to a large and growing set of complex ideas leveraging MPC and RL. This work illuminates the differences, similarities, and fundamentals that allow for different combination algorithms and categorizes existing work accordingly. Particularly, we focus on the versatile actor-critic RL approach as a basis for our categorization and examine how the online optimization approach of MPC can be used to improve the overall closed-loop performance of a policy.
FRAUD-RLA: A new reinforcement learning adversarial attack against credit card fraud detection
Lunghi, Daniele, Molinghen, Yannick, Simitsis, Alkis, Lenaerts, Tom, Bontempi, Gianluca
The main works [10, 11] attack the same realistic fraud detection Adversarial attacks pose a significant threat to data-driven engine called BankSealer [9]. In both works, the authors systems, and researchers have spent considerable resources rightfully consider domain-specific challenges generally absent studying them. Despite its economic relevance, this trend in other adversarial works, such as the intricate feature largely overlooked the issue of credit card fraud detection. To engineering process performed in fraud detection. However, address this gap, we propose a new threat model that demonstrates they operate under the assumption that fraudsters can access the limitations of existing attacks and highlights the the customers' transaction history. As the authors point out, necessity to investigate new approaches. We then design a this may be achieved through the introduction of malware into new adversarial attack for credit card fraud detection, employing the victim's devices. However, this considerably increases the reinforcement learning to bypass classifiers. This attack, difficulty of performing any attack, as fraudsters must first called FRAUD-RLA, is designed to maximize the attacker's compromise the customer's device and observe past transaction reward by optimizing the exploration-exploitation tradeoff history, which constitutes a significantly more complex and working with significantly less required knowledge than undertaking than stealing or cloning a card.
Online Hybrid-Belief POMDP with Coupled Semantic-Geometric Models and Semantic Safety Awareness
Lemberg, Tuvy, Indelman, Vadim
Robots operating in complex and unknown environments frequently require geometric-semantic representations of the environment to safely perform their tasks. While inferring the environment, they must account for many possible scenarios when planning future actions. Since objects' class types are discrete and the robot's self-pose and the objects' poses are continuous, the environment can be represented by a hybrid discrete-continuous belief which is updated according to models and incoming data. Prior probabilities and observation models representing the environment can be learned from data using deep learning algorithms. Such models often couple environmental semantic and geometric properties. As a result, semantic variables are interconnected, causing semantic state space dimensionality to increase exponentially. In this paper, we consider planning under uncertainty using partially observable Markov decision processes (POMDPs) with hybrid semantic-geometric beliefs. The models and priors consider the coupling between semantic and geometric variables. Within POMDP, we introduce the concept of semantically aware safety. Obtaining representative samples of the theoretical hybrid belief, required for estimating the value function, is very challenging. As a key contribution, we develop a novel form of the hybrid belief and leverage it to sample representative samples. We show that under certain conditions, the value function and probability of safety can be calculated efficiently with an explicit expectation over all possible semantic mappings. Our simulations show that our estimates of the objective function and probability of safety achieve similar levels of accuracy compared to estimators that run exhaustively on the entire semantic state-space using samples from the theoretical hybrid belief. Nevertheless, the complexity of our estimators is polynomial rather than exponential.
Anytime Incremental $\rho$POMDP Planning in Continuous Spaces
Benchetrit, Ron, Lev-Yehudi, Idan, Zhitnikov, Andrey, Indelman, Vadim
Partially Observable Markov Decision Processes (POMDPs) provide a robust framework for decision-making under uncertainty in applications such as autonomous driving and robotic exploration. Their extension, $\rho$POMDPs, introduces belief-dependent rewards, enabling explicit reasoning about uncertainty. Existing online $\rho$POMDP solvers for continuous spaces rely on fixed belief representations, limiting adaptability and refinement - critical for tasks such as information-gathering. We present $\rho$POMCPOW, an anytime solver that dynamically refines belief representations, with formal guarantees of improvement over time. To mitigate the high computational cost of updating belief-dependent rewards, we propose a novel incremental computation approach. We demonstrate its effectiveness for common entropy estimators, reducing computational cost by orders of magnitude. Experimental results show that $\rho$POMCPOW outperforms state-of-the-art solvers in both efficiency and solution quality.
Reinforcement Learning for Long-Horizon Interactive LLM Agents
Chen, Kevin, Cusumano-Towner, Marco, Huval, Brody, Petrenko, Aleksei, Hamburger, Jackson, Koltun, Vladlen, Krähenbühl, Philipp
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned large language models (LLMs) can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive LOOP, a data- and memory-efficient variant of proximal policy optimization. LOOP uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent trained with LOOP in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi-domain, multi-app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.
OmniRL: In-Context Reinforcement Learning by Large-Scale Meta-Training in Randomized Worlds
Wang, Fan, Shao, Pengtao, Zhang, Yiming, Yu, Bo, Liu, Shaoshan, Ding, Ning, Cao, Yang, Kang, Yu, Wang, Haifeng
We introduce OmniRL, a highly generalizable in-context reinforcement learning (ICRL) model that is meta-trained on hundreds of thousands of diverse tasks. These tasks are procedurally generated by randomizing state transitions and rewards within Markov Decision Processes. To facilitate this extensive meta-training, we propose two key innovations: 1. An efficient data synthesis pipeline for ICRL, which leverages the interaction histories of diverse behavior policies; and 2. A novel modeling framework that integrates both imitation learning and reinforcement learning (RL) within the context, by incorporating prior knowledge. For the first time, we demonstrate that in-context learning (ICL) alone, without any gradient-based fine-tuning, can successfully tackle unseen Gymnasium tasks through imitation learning, online RL, or offline RL. Additionally, we show that achieving generalized ICRL capabilities-unlike task identification-oriented few-shot learning-critically depends on long trajectories generated by variant tasks and diverse behavior policies. By emphasizing the potential of ICL and departing from pre-training focused on acquiring specific skills, we further underscore the significance of meta-training aimed at cultivating the ability of ICL itself.
Gap-Dependent Bounds for Federated $Q$-learning
Zhang, Haochen, Zheng, Zhong, Xue, Lingzhou
We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$-type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$, states $S$, and actions $A$, where $T$ is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a $\log T$-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on $MSA$ from the $\log T$ term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when $M=1$, removing $SA$ from the $\log T$ term.
A weak convergence approach to large deviations for stochastic approximations
Hult, Henrik, Lindhe, Adam, Nyquist, Pierre, Wu, Guo-Jhen
Stochastic approximation (SA) algorithms, first introduced by Robbins and Monro in the 1950's [24], has become one of the most important classes of stochastic numerical methods. Originally aimed at finding the the root of a continuous function given noisy observations, SA is now a fundamental tool in a range of areas such as statistics, optimization, electrical engineering, and machine learning, to mention but a few. Within the latter, the importance of SA algorithms is illustrated by the fact that a specific subclass of methods--stochastic gradient descent (SGD) methods--is central to the training of deep learning methods, and in reinforcement learning the standard methods (Q-learning and temporal-difference-learning) are variants of SA. The general class that is SA algorithms with state-dependent noise (see below for the definition) therefore constitute a rich and important family of stochastic recursive algorithms. In addition to the examples already mentioned (SGD, reinforcement learning), this class also includes persistent contrastive divergence, adaptive Markov chain Monte-Carlo (MCMC) and extended ensemble algorithms such as the Wang-Landau algorithm. The theory of SA stems from the pioneering work of Robbins and Monro [24] and Kiefer and Wolfowitz [20], and remains an active research area within probability theory. This is in part due to the many and diverse applications of SA algorithms where, due to the complex nature of the systems under considerations, different variants of the original Robbins-Monro algorithm are needed. In turn, developing the theoretical foundation for SA algorithms, such as, e.g., convergence results, central limit theorems, concentration results and results on deviations, is of fundamental importance; monographs covering many of the standard results of
Toward Task Generalization via Memory Augmentation in Meta-Reinforcement Learning
Bao, Kaixi, Li, Chenhao, As, Yarden, Krause, Andreas, Hutter, Marco
In reinforcement learning (RL), agents often struggle to perform well on tasks that differ from those encountered during training. This limitation presents a challenge to the broader deployment of RL in diverse and dynamic task settings. In this work, we introduce memory augmentation, a memory-based RL approach to improve task generalization. Our approach leverages task-structured augmentations to simulate plausible out-of-distribution scenarios and incorporates memory mechanisms to enable context-aware policy adaptation. Trained on a predefined set of tasks, our policy demonstrates the ability to generalize to unseen tasks through memory augmentation without requiring additional interactions with the environment. Through extensive simulation experiments and real-world hardware evaluations on legged locomotion tasks, we demonstrate that our approach achieves zero-shot generalization to unseen tasks while maintaining robust in-distribution performance and high sample efficiency.