MemER: Scaling Up Memory for Robot Control via Experience Retrieval

Sridhar, Ajay, Pan, Jennifer, Sharma, Satvik, Finn, Chelsea

arXiv.org Artificial Intelligence 

Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/. In recent times, we have seen significant strides in the language-following and generalization capabilities of robotic manipulation policies (Brohan et al., 2023; Intelligence et al., 2025; Kim et al., 2024; NVIDIA et al., 2025; Team et al., 2025). While these policies are improving for real-world deployment, a critical limitation remains: the absence of long-term memory. Memory allows humans to handle the inherent partial observability found in their environment. For instance, if a person wanted to make a sandwich, they would have to recall where they saw the jar of peanut butter or the knife, especially if these items had not been recently viewed. The ability to form and retrieve memories is a crucial step towards robots solving complex, multi-step tasks.