AITopics | offline rl algorithm

Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsApr-25-2026, 02:24:44 GMT

Reinforcement learning (RL) has shown great promise for developing dialogue1 management (DM) agents that are non-myopic, conduct rich conversations, and2 maximize overall user satisfaction. Despite recent developments in RL and lan-3 guage models (LMs), using RL to power conversational chatbots remains challeng-4 ing, in part because RL requires online exploration to learn effectively, whereas5 collecting novel human-bot interactions can be expensive and unsafe. This issue is6 exacerbated by the combinatorial action spaces facing these algorithms, as most7 LM agents generate responses at the word level. We develop a variety of RL algo-8 rithms, specialized to dialogue planning, that leverage recent Mixture-of-Expert9 Language Models (MoE-LMs)--models that capture diverse semantics, generate10 utterances reflecting different intents, and are amenable for multi-turn DM. By11 exploiting MoE-LM structure, our methods significantly reduce the size of the12 action space and improve the efficacy of RL-based DM.

Add feedback

12bcf58a1c09a0fcb5310f3589291ab4-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 02:24:41 GMT

machine learning, natural language, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: Europe (0.46)

Genre:

Research Report > New Finding (1.00)
Personal (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Communications > Social Media (0.95)
(3 more...)

Add feedback

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Neural Information Processing SystemsApr-24-2026, 22:26:44 GMT

Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to "good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.92)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation

Neural Information Processing SystemsMar-22-2026, 08:27:12 GMT

Mobile devices such as smartphones, laptops, and tablets can often connect to multiple access networks (e.g., Wi-Fi, LTE, and 5G) simultaneously.Recent advancements facilitate seamless integration of these connections below the transport layer, enhancing the experience for apps that lack inherent multi-path support.This optimization hinges on dynamically determining the traffic distribution across networks for each device, a process referred to as multi-access traffic splitting.This paper introduces NetworkGym, a high-fidelity network environment simulator that facilitates generating multiple network traffic flows and multi-access traffic splitting.This simulator facilitates training and evaluating different RL-based solutions for the multi-access traffic splitting problem.Our initial explorations demonstrate that the majority of existing state-of-the-art offline RL algorithms (e.g. CQL) fail to outperform certain hand-crafted heuristic policies on average.This illustrates the urgent need to evaluate offline RL algorithms against a broader range of benchmarks, rather than relying solely on popular ones such as D4RL.We also propose an extension to the TD3+BC algorithm, named Pessimistic TD3 (PTD3), and demonstrate that it outperforms many state-of-the-art offline RL algorithms.PTD3's behavioral constraint mechanism, which relies on value-function pessimism, is theoretically motivated and relatively simple to implement.We open source our code and offline datasets at github.com/hmomin/networkgym.

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Add feedback

Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL

Neural Information Processing SystemsMar-18-2026, 13:23:23 GMT

Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

df54302388bbc145aacaa1a54a4a5933-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 10:22:16 GMT

artificial intelligence, machine learning, reinforcement learning, (12 more...)

Neural Information Processing Systems

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)

Add feedback

Offline Behavior Distillation

Neural Information Processing SystemsFeb-18-2026, 06:01:08 GMT

Inspired by dataset distillation (DD) [Wang et al., 2018, Zhao et al., (Corollary 1). Extensive experiments on nine datasets of D4RL benchmark [Fu et al., 2020] with multiple environments and data qualities illustrate that our Av-PBC remarkably promotes the OBD performance, Moreover, Av-PBC has a significant convergence speed and requires only a quarter of distillation steps compared to DBC and PBC.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(4 more...)

Add feedback

c112271dab20d38b43a1c0669f86b4f6-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-17-2026, 22:40:56 GMT

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.93)
Telecommunications (0.68)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Robots (0.67)

Add feedback

Survival Instinct in Offline Reinforcement Learning

Neural Information Processing SystemsFeb-16-2026, 23:26:08 GMT

This phenomenon cannot be easily explained by offline RL's return maximization objective.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Portugal > Porto > Porto (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

GT A: Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

Neural Information Processing SystemsFeb-15-2026, 13:21:34 GMT

Offline Reinforcement Learning (Offline RL) presents challenges of learning effective decision-making policies from static datasets without any online interactions.

machine learning, reinforcement learning, trajectory, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Filters

Collaborating Authors

offline rl algorithm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management Anonymous Author(s) Affiliation Address email

12bcf58a1c09a0fcb5310f3589291ab4-Paper-Conference.pdf

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation

Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL

df54302388bbc145aacaa1a54a4a5933-Paper-Conference.pdf

Offline Behavior Distillation

c112271dab20d38b43a1c0669f86b4f6-Paper-Datasets_and_Benchmarks_Track.pdf

Survival Instinct in Offline Reinforcement Learning

GT A: Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning