Markov Models
Syntax-Guided Procedural Synthesis of Molecules
Sun, Michael, Lo, Alston, Gao, Wenhao, Guo, Minghao, Thost, Veronika, Chen, Jie, Coley, Connor, Matusik, Wojciech
Designing synthetically accessible molecules and recommending analogs to unsynthesizable molecules are important problems for accelerating molecular discovery. We reconceptualize both problems using ideas from program synthesis. Drawing inspiration from syntax-guided synthesis approaches, we decouple the syntactic skeleton from the semantics of a synthetic tree to create a bilevel framework for reasoning about the combinatorial space of synthesis pathways. Given a molecule we aim to generate analogs for, we iteratively refine its skeletal characteristics via Markov Chain Monte Carlo simulations over the space of syntactic skeletons. Given a black-box oracle to optimize, we formulate a joint design space over syntactic templates and molecular descriptors and introduce evolutionary algorithms that optimize both syntactic and semantic dimensions synergistically. Our key insight is that once the syntactic skeleton is set, we can amortize over the search complexity of deriving the program's semantics by training policies to fully utilize the fixed horizon Markov Decision Process imposed by the syntactic template. We demonstrate performance advantages of our bilevel framework for synthesizable analog generation and synthesizable molecule design. Notably, our approach offers the user explicit control over the resources required to perform synthesis and biases the design space towards simpler solutions, making it particularly promising for autonomous synthesis platforms.
Tree-structured Markov random fields with Poisson marginal distributions
Côté, Benjamin, Cossette, Hélène, Marceau, Etienne
Having graphs underlie multivariate distributions performs a translation of their vast range of topologies to a rich variety of dependence schemes; this is the premise of probabilistic graphical models. The graph, through its edges and its vertices, serves as a representation of the dependence relations knitting the random variables to one another. Among others, [Koller and Friedman, 2009] and [Maathuis et al., 2018] dive deeply into probabilistic graphical models, with much emphasis on Bayesian networks and Markov random fields (MRFs), also called Markov networks or undirected graphical models. In [Besag, 1974], a seminal paper on MRFs, the author defines various families through their conditional distributions: the distribution of a random variable is specified given the value taken by its neighbours according to the underlying graph. A family that has drawn particular attention to represent a vector of count random variables is Besag's auto-Poisson MRFs, also called Poisson graphical models, where each vertex's conditional distribution is Poisson, the values taken by the neighbouring vertices' random variables influencing its mean parameter.
Rethinking State Disentanglement in Causal Reinforcement Learning
Cao, Haiyao, Zhang, Zhen, Cai, Panpan, Liu, Yuhang, Zou, Jinan, Abbasnejad, Ehsan, Huang, Biwei, Gong, Mingming, Hengel, Anton van den, Shi, Javen Qinfeng
One of the significant challenges in reinforcement learning (RL) when dealing with noise is estimating latent states from observations. Causality provides rigorous theoretical support for ensuring that the underlying states can be uniquely recovered through identifiability. Consequently, some existing work focuses on establishing identifiability from a causal perspective to aid in the design of algorithms. However, these results are often derived from a purely causal viewpoint, which may overlook the specific RL context. We revisit this research line and find that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states. More importantly, removing these assumptions allows algorithm design to go beyond the earlier boundaries constrained by them. Leveraging these insights, we propose a novel approach for general partially observable Markov Decision Processes (POMDPs) by replacing the complicated structural constraints in previous methods with two simple constraints for transition and reward preservation. With the two constraints, the proposed algorithm is guaranteed to disentangle state and noise that is faithful to the underlying dynamics. Empirical evidence from extensive benchmark control tasks demonstrates the superiority of our approach over existing counterparts in effectively disentangling state belief from noise.
Mastering the Digital Art of War: Developing Intelligent Combat Simulation Agents for Wargaming Using Hierarchical Reinforcement Learning
In today's rapidly evolving military landscape, advancing artificial intelligence (AI) in support of wargaming becomes essential. Despite reinforcement learning (RL) showing promise for developing intelligent agents, conventional RL faces limitations in handling the complexity inherent in combat simulations. This dissertation proposes a comprehensive approach, including targeted observation abstractions, multi-model integration, a hybrid AI framework, and an overarching hierarchical reinforcement learning (HRL) framework. Our localized observation abstraction using piecewise linear spatial decay simplifies the RL problem, enhancing computational efficiency and demonstrating superior efficacy over traditional global observation methods. Our multi-model framework combines various AI methodologies, optimizing performance while still enabling the use of diverse, specialized individual behavior models. Our hybrid AI framework synergizes RL with scripted agents, leveraging RL for high-level decisions and scripted agents for lower-level tasks, enhancing adaptability, reliability, and performance. Our HRL architecture and training framework decomposes complex problems into manageable subproblems, aligning with military decision-making structures. Although initial tests did not show improved performance, insights were gained to improve future iterations. This study underscores AI's potential to revolutionize wargaming, emphasizing the need for continued research in this domain.
Reduce, Reuse, Recycle: Categories for Compositional Reinforcement Learning
Bakirtzis, Georgios, Savvas, Michail, Zhao, Ruihan, Chinchali, Sandeep, Topcu, Ufuk
In reinforcement learning, conducting task composition by forming cohesive, executable sequences from multiple tasks remains challenging. However, the ability to (de)compose tasks is a linchpin in developing robotic systems capable of learning complex behaviors. Yet, compositional reinforcement learning is beset with difficulties, including the high dimensionality of the problem space, scarcity of rewards, and absence of system robustness after task composition. To surmount these challenges, we view task composition through the prism of category theory -- a mathematical discipline exploring structures and their compositional relationships. The categorical properties of Markov decision processes untangle complex tasks into manageable sub-tasks, allowing for strategical reduction of dimensionality, facilitating more tractable reward structures, and bolstering system robustness. Experimental results support the categorical theory of reinforcement learning by enabling skill reduction, reuse, and recycling when learning complex robotic arm tasks.
A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning
Zeng, Sihan, Doan, Thinh T., Romberg, Justin
We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" than the latter. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, PL condition, and general non-convexity. We apply our framework to various policy optimization problems. First, we look at the infinite-horizon average-reward MDP with finite state and action spaces and derive a convergence rate of $O(k^{-2/5})$ for the online actor-critic algorithm under function approximation, which recovers the best known rate derived specifically for this problem. Second, we study the linear-quadratic regulator and show that an online actor-critic method converges with rate $O(k^{-2/3})$. Third, we use the actor-critic algorithm to solve the policy optimization problem in an entropy regularized Markov decision process, where we also establish a convergence of $O(k^{-2/3})$. The results we derive for both the second and third problem are novel and previously unknown in the literature. Finally, we briefly present the application of our framework to gradient-based policy evaluation algorithms in reinforcement learning.
Accelerated Markov Chain Monte Carlo Using Adaptive Weighting Scheme
Wang, Yanbo, Chen, Wenyu, Shan, Shimin
Gibbs sampling is one of the most commonly used Markov Chain Monte Carlo (MCMC) algorithms due to its simplicity and efficiency. It cycles through the latent variables, sampling each one from its distribution conditional on the current values of all the other variables. Conventional Gibbs sampling is based on the systematic scan (with a deterministic order of variables). In contrast, in recent years, Gibbs sampling with random scan has shown its advantage in some scenarios. However, almost all the analyses of Gibbs sampling with the random scan are based on uniform selection of variables. In this paper, we focus on a random scan Gibbs sampling method that selects each latent variable non-uniformly. Firstly, we show that this non-uniform scan Gibbs sampling leaves the target posterior distribution invariant. Then we explore how to determine the selection probability for latent variables. In particular, we construct an objective as a function of the selection probability and solve the constrained optimization problem. We further derive an analytic solution of the selection probability, which can be estimated easily. Our algorithm relies on the simple intuition that choosing the variable updates according to their marginal probabilities enhances the mixing time of the Markov chain. Finally, we validate the effectiveness of the proposed Gibbs sampler by conducting a set of experiments on real-world applications.
Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning
Oh, Jihwan, Kim, Sungnyun, Kim, Gahee, Kim, Sunghwan, Yun, Se-Young
Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught with challenges. Building on this foundational importance, we present EAQ, Episodes Augmentation guided by Q-total loss, a novel approach for offline MARL framework utilizing diffusion models. EAQ integrates the Q-total function directly into the diffusion model as a guidance to maximize the global returns in an episode, eliminating the need for separate training. Our focus primarily lies on cooperative scenarios, where agents are required to act collectively towards achieving a shared goal-essentially, maximizing global returns. Consequently, we demonstrate that our episodes augmentation in a collaborative manner significantly boosts offline MARL algorithm compared to the original dataset, improving the normalized return by +17.3% and +12.9% for medium and poor behavioral policies in SMAC simulator, respectively.
Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models
Haider, Adnan, Na, Xingyu, McDermott, Erik, Ng, Tim, Huang, Zhen, Zhuang, Xiaodan
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio. Notably, this training framework is independent of hidden Markov models (HMMs) and lattices, eliminating the need for substantial decision-making regarding HMM topology, lexicon, and graph generation, as typically required in standard discriminative training approaches. Compared to additional fine-tuning with MMI or MWER loss on the encoder, FDT is shown to be more effective in achieving greater reductions in Word Error Rate (WER) on streaming models trained on LibriSpeech. Additionally, this method is shown to be effective in further improving a converged word-piece streaming E2E model trained on 600k hours of assistant and dictation dataset.
Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach
Peralez, Johan, Delage, Aurélien, Castellini, Jacopo, Cunha, Rafael F., Dibangoye, Jilles S.
Centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to epsilon-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of Bellman's principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that epsilon-optimal value functions are piecewise linear and convex in sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings applies while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against epsilon-optimal simultaneous-move solvers confirm the superiority of the novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.