VideoMAR: Autoregressive Video Generation with Continuous Tokens
–Neural Information Processing Systems
Mask-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose VideoMAR, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error.
Neural Information Processing Systems
Jun-17-2026, 07:16:58 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Health & Medicine (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Machine Learning > Neural Networks (1.00)
- Natural Language > Large Language Model (0.94)
- Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence