WorldGym: World Model as An Environment for Policy Evaluation

Quevedo, Julian, Sharma, Ansh Kumar, Sun, Yixiang, Suryavanshi, Varad, Liang, Percy, Yang, Sherry

Oct-1-2025–arXiv.org Artificial Intelligence

Robots can help humans in ways that range from home robots performing chores (Shafiullah et al., With the development of generative models trained on large-scale video data (Ho et al., 2022; Villegas et al., 2022; Singer et al., 2022), recent work has shown that video world models can visually emulate See videos and code at https://world-model-eval.github.io Inspired by this observation, we propose a world-model-based policy evaluation environment (WorldGym), as shown in Figure 1. To enable efficient rollouts of policies which predict different-length action chunks, WorldGym aligns its diffusion horizon length with policies' chunk sizes at inference time. With video rollouts from the world model, WorldGym then uses a vision-language model (VLM) to determine tasks' success We then use the world model to evaluate VLA-based robot policies by rolling out the policies in the world model starting from real initial frames, and compare their success rates (policy values) in WorldGym to those achieved in real-world experiments. We propose flexibly aligning diffusion horizon length with policies' action chunk sizes for efficient We consider a multi-task, finite-horizon, partially observable Markov Decision Process (POMDP) (Puterman, 2014; Kaelbling et al., 1995), specified by In this section, we first describe our implementation of world model training and inference.

artificial intelligence, machine learning, world model, (15 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (0.67)
- Leisure & Entertainment (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Cognitive Science > Problem Solving (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (1.00)