offline rl algorithm
Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management Anonymous Author(s) Affiliation Address email
Reinforcement learning (RL) has shown great promise for developing dialogue1 management (DM) agents that are non-myopic, conduct rich conversations, and2 maximize overall user satisfaction. Despite recent developments in RL and lan-3 guage models (LMs), using RL to power conversational chatbots remains challeng-4 ing, in part because RL requires online exploration to learn effectively, whereas5 collecting novel human-bot interactions can be expensive and unsafe. This issue is6 exacerbated by the combinatorial action spaces facing these algorithms, as most7 LM agents generate responses at the word level. We develop a variety of RL algo-8 rithms, specialized to dialogue planning, that leverage recent Mixture-of-Expert9 Language Models (MoE-LMs)--models that capture diverse semantics, generate10 utterances reflecting different intents, and are amenable for multi-turn DM. By11 exploiting MoE-LM structure, our methods significantly reduce the size of the12 action space and improve the efficacy of RL-based DM.
Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to "good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.
NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation
Mobile devices such as smartphones, laptops, and tablets can often connect to multiple access networks (e.g., Wi-Fi, LTE, and 5G) simultaneously.Recent advancements facilitate seamless integration of these connections below the transport layer, enhancing the experience for apps that lack inherent multi-path support.This optimization hinges on dynamically determining the traffic distribution across networks for each device, a process referred to as multi-access traffic splitting.This paper introduces NetworkGym, a high-fidelity network environment simulator that facilitates generating multiple network traffic flows and multi-access traffic splitting.This simulator facilitates training and evaluating different RL-based solutions for the multi-access traffic splitting problem.Our initial explorations demonstrate that the majority of existing state-of-the-art offline RL algorithms (e.g. CQL) fail to outperform certain hand-crafted heuristic policies on average.This illustrates the urgent need to evaluate offline RL algorithms against a broader range of benchmarks, rather than relying solely on popular ones such as D4RL.We also propose an extension to the TD3+BC algorithm, named Pessimistic TD3 (PTD3), and demonstrate that it outperforms many state-of-the-art offline RL algorithms.PTD3's behavioral constraint mechanism, which relies on value-function pessimism, is theoretically motivated and relatively simple to implement.We open source our code and offline datasets at github.com/hmomin/networkgym.
Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL
Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.
Offline Behavior Distillation
Inspired by dataset distillation (DD) [Wang et al., 2018, Zhao et al., (Corollary 1). Extensive experiments on nine datasets of D4RL benchmark [Fu et al., 2020] with multiple environments and data qualities illustrate that our Av-PBC remarkably promotes the OBD performance, Moreover, Av-PBC has a significant convergence speed and requires only a quarter of distillation steps compared to DBC and PBC.