Goto

Collaborating Authors

 demonstration data


Supplementary material: Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments

Neural Information Processing Systems

We will use the well known Performance Difference Lemma [16] in our analysis. We can obtain a performance difference lemma for the meta-policies as follows. Here, we get (a)is from Assumption 3.1 from which we have P In this section, we describe all the simulation and real-world environments in detail. B.1 Simulation Environments Point 2DNavigation: Point 2DNavigation [9] is a 2 dimensional goal reaching environment with S R2, A R2, and the following dynamics, xt+1 = xt +dxt, yt+1 = xt +dyt, such that dx2t +dy2t 0.12 Where xt and yt are the x and y location of the agent, dxt and dyt are the actions taken which correspond to the displacement in the x and y direction respectively, all taken at time step t. The goals are located on a semi circle of radius 2, and the episode terminates when the agent reaches the goal or spends more than 100time steps in the environment.



DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Neural Information Processing Systems

Pre-trained vision language models (VLMs), though powerful, typically lack training on decision-centric data, rendering them sub-optimal for decision-making tasks such as in-the-wild device control through Graphical User Interfaces (GUIs) when used off-the-shelf. While training with static demonstrations has shown some promise, we show that such methods fall short when controlling real GUIs due to their failure to deal with real world stochasticity and dynamism not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline and offline-to-online RL. We first build a scalable and parallelizable Android learning environment equipped with a VLM-based general-purpose evaluator and then identify the key design choices for simple and effective RL in this domain. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.5B VLM trained with RL achieves a 49.5\% absolute improvement -- from 17.7 to 67.2\% success rate -- over supervised fine-tuning with static human demonstration data. It is worth noting that such improvement is achieved without any additional supervision or demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3\% success rate) and the 17B CogAgent trained with AitW data (14.4\%),