Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

Mar-20-2026, 17:00:51 GMT–Neural Information Processing Systems

Real-world decision-making tasks are usually partially observable Markov decision processes (POMDPs), where the state is not fully observable. Recent progress has demonstrated that recurrent reinforcement learning (RL), which consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction and a multilayer perceptron (MLP) policy for decision making, can mitigate partial observability and serve as a robust baseline for POMDP tasks. However, prior recurrent RL algorithms have faced issues with training instability. In this paper, we find that this instability stems from the autoregressive nature of RNNs, which causes even small changes in RNN parameters to produce large output variations over long trajectories.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Mar-20-2026, 17:00:51 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (1.00)