Goto

Collaborating Authors

 Markov Models


Universal Post-Processing Networks for Joint Optimization of Modules in Task-Oriented Dialogue Systems

arXiv.org Artificial Intelligence

Post-processing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.


Enhancing Memory and Imagination Consistency in Diffusion-based World Models via Linear-Time Sequence Modeling

arXiv.org Artificial Intelligence

World models are crucial for enabling agents to simulate and plan within environments, yet existing approaches struggle with long-term dependencies and inconsistent predictions. We introduce EDELINE, a novel framework that integrates diffusion models with linear-time state space modelsto enhance memory retention and temporal consistency. EDELINE employs a recurrent embedding module based on Mamba SSMs for processing unbounded sequences, a unified architecture for joint reward and termination prediction, and dynamic loss harmonization to balance multi-task learning. Our results across multiple benchmarks demonstrate EDELINE's superiority and robustness over prior baselines in long-horizon tasks.


Transition Transfer $Q$-Learning for Composite Markov Decision Processes

arXiv.org Machine Learning

To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of $\tilde{O}(\sqrt{eH^5N})$ that scales independently of the ambient dimension. Here, $N$ represents the number of trajectories in the target task, while $e$ quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.


Safety Alignment Depth in Large Language Models: A Markov Chain Perspective

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly adopted in high-stakes scenarios, yet their safety mechanisms often remain fragile. Simple jailbreak prompts or even benign fine-tuning can bypass these protocols, underscoring the need to understand where and how they fail. Recent findings suggest that vulnerabilities emerge when alignment is confined to only the initial output tokens. Unfortunately, even with the introduction of deep safety alignment, determining the optimal safety depth remains an unresolved challenge. By leveraging the equivalence between autoregressive language models and Markov chains, this paper offers the first theoretical result on how to identify the ideal depth for safety alignment, and demonstrates how permutation-based data augmentation can tighten these bounds. Crucially, we reveal a fundamental interaction between alignment depth and ensemble width-indicating that broader ensembles can compensate for shallower alignments. These insights provide a theoretical foundation for designing more robust, scalable safety strategies that complement existing alignment approaches, opening new avenues for research into safer, more reliable LLMs.


Functional role of synchronization: A mean-field control perspective

arXiv.org Machine Learning

Our friend and mentor Peter Caines has, together with his colleagues, created new foundations for studying collective dynamics in complex systems. Of particular inspiration to us has been his pioneering work in mean-field games (MFGs) launched two decades ago [10, 24, 25], and the related field of mean-field control. Peter pointed the way to both formulate and solve the problem of collective dynamics arising in a large population of heterogeneous dynamical systems. In this paper we survey some elements of MFGs within the context of controlled coupled oscillators. We begin by introducing a model for a single oscillator: dθ(t) = (ω + u(t)) dt + σ dξ(t), mod 2π (1) where θ(t) [0, 2π) is the phase of the oscillator at time t, ω is the nominal frequency with units of radiansper-second, {ξ(t): t 0} is a standard Wiener process, and u(t) is a control signal whose interpretation depends on the context. Unless otherwise noted, the SDEs are interpreted in their Itô form.


Reinforcement Learning on Reconfigurable Hardware: Overcoming Material Variability in Laser Material Processing

arXiv.org Artificial Intelligence

Ensuring consistent processing quality is challenging in laser processes due to varying material properties and surface conditions. Although some approaches have shown promise in solving this problem via automation, they often rely on predetermined targets or are limited to simulated environments. To address these shortcomings, we propose a novel real-time reinforcement learning approach for laser process control, implemented on a Field Programmable Gate Array to achieve real-time execution. Our experimental results from laser welding tests on stainless steel samples with a range of surface roughnesses validated the method's ability to adapt autonomously, without relying on reward engineering or prior setup information. Specifically, the algorithm learned the correct power profile for each unique surface characteristic, demonstrating significant improvements over hand-engineered optimal constant power strategies -- up to 23% better performance on rougher surfaces and 7% on mixed surfaces. This approach represents a significant advancement in automating and optimizing laser processes, with potential applications across multiple industries.


A Theoretical Justification for Asymmetric Actor-Critic Algorithms

arXiv.org Machine Learning

In reinforcement learning for partially observable environments, many successful algorithms were developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates an error term arising from aliasing in the agent state.


adabmDCA 2.0 -- a flexible but easy-to-use package for Direct Coupling Analysis

arXiv.org Artificial Intelligence

In this methods article, we provide a flexible but easy-to-use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \texttt{adabmDCA 2.0} is available in different programming languages (C++, Julia, Python) usable on different architectures (single-core and multi-core CPU, GPU) using a common front-end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue-residue contact prediction, mutational-effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.


Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models

arXiv.org Machine Learning

Reinforcement Learning (RL) (Sutton and Barto, We tackle average-reward infinite-horizon 2018) tackles the sequential decision-making problem POMDPs with an unknown transition model of an agent interacting with an unknown or partially but a known observation model, a setting known environment with the goal of maximizing the that has been previously addressed in two long-term sum of rewards. The RL agent should tradeoff limiting ways: (i) frequentist methods relying between exploring the environment to learn its on suboptimal stochastic policies having structure and exploiting the estimates to compute a a minimum probability of choosing each action, policy that maximizes the reward. This problem has and (ii) Bayesian approaches employing been successfully addressed in past works under the the optimal policy class but requiring MDP formulation (Bartlett and Tewari, 2009; Jaksch strong assumptions about the consistency et al., 2010; Zanette and Brunskill, 2019). MDPs assume of employed estimators. Our work removes full observability of the state space but this assumption these limitations by proving convenient estimation is often violated in many real-world scenarios guarantees for the transition model such as robotics or finance, where only a partial and introducing an optimistic algorithm that observation of the environment is available. In this leverages the optimal class of deterministic case, it is more appropriate to model the problem using belief-based policies. We introduce modifications Partially-Observable MDPs (Sondik, 1978).


Deceptive Sequential Decision-Making via Regularized Policy Optimization

arXiv.org Artificial Intelligence

Autonomous systems are increasingly expected to operate in the presence of adversaries, though an adversary may infer sensitive information simply by observing a system, without even needing to interact with it. Therefore, in this work we present a deceptive decision-making framework that not only conceals sensitive information, but in fact actively misleads adversaries about it. We model autonomous systems as Markov decision processes, and we consider adversaries that attempt to infer their reward functions using inverse reinforcement learning. To counter such efforts, we present two regularization strategies for policy synthesis problems that actively deceive an adversary about a system's underlying rewards. The first form of deception is ``diversionary'', and it leads an adversary to draw any false conclusion about what the system's reward function is. The second form of deception is ``targeted'', and it leads an adversary to draw a specific false conclusion about what the system's reward function is. We then show how each form of deception can be implemented in policy optimization problems, and we analytically bound the loss in total accumulated reward that is induced by deception. Next, we evaluate these developments in a multi-agent sequential decision-making problem with one real agent and multiple decoys. We show that diversionary deception can cause the adversary to believe that the most important agent is the least important, while attaining a total accumulated reward that is $98.83\%$ of its optimal, non-deceptive value. Similarly, we show that targeted deception can make any decoy appear to be the most important agent, while still attaining a total accumulated reward that is $99.25\%$ of its optimal, non-deceptive value.