Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Mereu, Riccardo, Scannell, Aidan, Hou, Yuxin, Zhao, Yi, Jitta, Aditya, Dominguez, Antonio, Acerbi, Luigi, Storkey, Amos, Chang, Paul

Oct-9-2025–arXiv.org Artificial Intelligence

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Oct-9-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language (0.95)
  - Cognitive Science > Problem Solving (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)