sample efficient reinforcement learning
Review for NeurIPS paper: Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
Weaknesses: Below I list my concerns regarding the setup and reported results. In the finite case, devising an algorithm for the online setup posed more serious challenges than the generative setup. The restriction of the results to the generative setup hides the price to pay for the need to navigate in the MDP. Could you at least elaborate on explaining the potential difficulties and challenges involved in extending the results to the online case? Could one hope for a similar gain in the sample complexity (over structure-oblivious algorithms)?
Review for NeurIPS paper: Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
The contributions were unanimously appreciated (the paper introduces an interesting structure, the regret analysis including the low-matrix estimation part is interesting). We recommend the paper for acceptance and encourage the authors to account for the reviewers' comments when preparing the camera-ready version of the paper.
PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning
In Reinforcement Learning (RL), enhancing sample efficiency is crucial, particularly in scenarios when data acquisition is costly and risky. In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction. However, these multiple updates often lead the model to overfit to earlier interactions, which is referred to as the loss of plasticity. Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects. Synthetic experiments on the CIFAR-10 dataset reveal that finding smoother minima of loss landscape enhances input plasticity, whereas refined gradient propagation improves label plasticity.
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
We consider the question of learning Q -function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If Q -function is Lipschitz continuous, then the minimal sample complexity for estimating \epsilon -optimal Q -function is known to scale as \Omega(\frac{1}{\epsilon {d_1 d_2 2}}) per classical non-parametric learning theory, where d_1 and d_2 denote the dimensions of the state and action spaces respectively. The Q -function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of Q -functions parameterized by its "rank" r, which contains all Lipschitz Q -functions as r\to\infty . As our key contribution, we develop a simple, iterative learning algorithm that finds \epsilon -optimal Q -function with sample complexity of \widetilde{O}(\frac{1}{\epsilon {\max(d_1, d_2) 2}}) when the optimal Q -function has low rank r and the discounting factor \gamma is below a certain threshold.
Sample Efficient Reinforcement Learning in Mixed Systems through Augmented Samples and Its Applications to Queueing Networks
This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic {\em given} the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including Manufacturing systems, communication networks, and queueing networks. We propose a sample-efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven (model-free), but it learns the policy from data samples from both real and augmented samples.
AI Planning Annotation for Sample Efficient Reinforcement Learning
The RL environment maintains the states of the grounded logistics planning domain. In the experiment, we define a planning task as an abstract planning task that is defined over the subset of the predicates and actions in the grounded logistics planning task. Therefore, the state mapping function is a projection of logical variables from an RL state to its planning state. The PDDL domain file used for the RL task is desribed as follows.
Sample Efficient Reinforcement Learning through Learning from Demonstrations in Minecraft
Scheller, Christian, Schraner, Yanick, Vogel, Manfred
Sample inefficiency of deep reinforcement learning methods is a major obstacle for their use in real-world applications. In this work, we show how human demonstrations can improve final performance of agents on the Minecraft minigame ObtainDiamond with only 8M frames of environment interaction. We propose a training procedure where policy networks are first trained on human data and later fine-tuned by reinforcement learning. Using a policy exploitation mechanism, experience replay and an additional loss against catastrophic forgetting, our best agent was able to achieve a mean score of 48. Our proposed solution placed 3 rd in the NeurIPS MineRL Competition for Sample-Efficient Reinforcement Learning.
Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration
We present an implementation of model-based online reinforcement learning (RL) for continuous domains with deterministic transitions that is specifically designed to achieve low sample complexity. To achieve low sample complexity, since the environment is unknown, an agent must intelligently balance exploration and exploitation, and must be able to rapidly generalize from observations. While in the past a number of related sample efficient RL algorithms have been proposed, to allow theoretical analysis, mainly model-learners with weak generalization capabilities were considered. Here, we separate function approximation in the model learner (which does require samples) from the interpolation in the planner (which does not require samples). For model-learning we apply Gaussian processes regression (GP) which is able to automatically adjust itself to the complexity of the problem (via Bayesian hyperparameter selection) and, in practice, often able to learn a highly accurate model from very little data. In addition, a GP provides a natural way to determine the uncertainty of its predictions, which allows us to implement the "optimism in the face of uncertainty" principle used to efficiently control exploration. Our method is evaluated on four common benchmark domains.
- Research Report (0.40)
- Instructional Material (0.34)