VideoTitans: Scalable Video Prediction with Integrated Short-and Long-term Memory
–Neural Information Processing Systems
Accurate video forecasting enables autonomous vehicles to anticipate hazards, robotics and surveillance systems to predict human intent, and environmental models to issue timely warnings for extreme weather events. However, existing methods remain limited: transformers rely on global attention with quadratic complexity, making them impractical for high-resolution, long-horizon video prediction, while convolutional and recurrent networks suffer from short-range receptive fields and vanishing gradients, losing key information over extended sequences. To overcome these challenges, we introduce VideoTitans, the first architecture to adapt the gradient-driven Titans memory--originally designed for language modelling to video prediction. VideoTitans integrates three core ideas: (i) a sliding-window attention core that scales linearly with sequence length and spatial resolution, (ii) an episodic memory that dynamically retains only informative tokens based on a gradient-based surprise signal, and (iii) a small set of persistent tokens encoding task-specific priors that stabilize training and enhance generalization.
Neural Information Processing Systems
Jun-19-2026, 19:01:53 GMT
- Genre:
- Overview (0.68)
- Research Report
- Experimental Study (1.00)
- New Finding (0.68)
- Industry:
- Information Technology > Security & Privacy (0.48)
- Health & Medicine > Consumer Health (0.34)
- Technology:
- Information Technology
- Data Science > Data Mining (0.93)
- Artificial Intelligence
- Vision (1.00)
- Robots (1.00)
- Representation & Reasoning (1.00)
- Natural Language (1.00)
- Cognitive Science (1.00)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology