467655d26fcc207bca08915dc91964c6-Paper-Conference.pdf

Jun-16-2026, 23:11:44 GMT–Neural Information Processing Systems

World models are generative systems that learn to predict an environment in response to actions, making them well suited for simulating complex, interactive settings [28, 2, 30, 74, 90]. Video diffusion models [11, 37, 44, 79, 55] have emerged as a powerful approach to architecting world models, especially when used with autoregressive next-frame prediction [1, 12, 18, 22, 41, 53, 60, 65, 73, 81, 35]. Existing video generation models, however, often struggle with long-horizon consistency due to limited temporal context windows, frequently forgetting previously seen scenes during revisits. This is due to the relatively small number of previously generated context frames that the model can consider when generating new frames--a problem primarily caused by the quadratic growth of computational complexity in the attention module of the underlying diffusion transformers. To address this challenge, current world models simply keep the number of context frames low to maintain computational feasibility.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Jun-16-2026, 23:11:44 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.93)
- Media
  - News (0.46)
  - Television (0.46)
  - Photography (0.46)
  - Film (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Cognitive Science (1.00)
  - Natural Language > Large Language Model (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found