AITopics | video generation

Collaborating Authors

video generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EgoVid-5M: ALarge-Scale Video-Action Dataset for Egocentric Video Generation

Neural Information Processing SystemsJun-23-2026, 06:57:43 GMT

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including 5M high-level textual descriptions and 65K fine-grained kinematic control annotations. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

artificial intelligence, data quality, video generation, (16 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.86)
Information Technology > Data Science > Data Quality > Data Cleaning (0.54)

Add feedback

Image as a World: Generating Interactive World from Single Image via Panoramic Video Generation

Neural Information Processing SystemsJun-23-2026, 04:00:09 GMT

Generating an interactive visual world from a single image is both challenging and practically valuable, as single-view inputs are easy to acquire and align well with prompt-driven applications such as gaming and virtual reality. This paper introduces a novel unified framework, Image as a World (IaaW), which synthesizes high-quality 360-degree videos from a single image that are both controllable and temporally continuable.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Fine Temporal Preference Optimization for Video Diffusion Models

Neural Information Processing SystemsJun-23-2026, 03:52:18 GMT

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

artificial intelligence, machine learning, video, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (0.67)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Neural Information Processing SystemsJun-23-2026, 03:29:22 GMT

We introduce InfinityStar, a unified spacetime autoregressive framework for highresolution image and dynamic video synthesis.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
(2 more...)

Add feedback

InGe PlaRiE4E5E6Sa = = =gnpsi 1 1 1ddtnicu000eintre Imateagd eT r&aj eFoctrorceie sand Video

Neural Information Processing SystemsJun-23-2026, 03:02:15 GMT

Existing te o grounded v xt ercom or images, e video image-to-video these generation limitations, but often generation models lack we introduce physical excel with at Ph plausibility ph producing ysCtrl, ysical a parameters no and photo-realistic vel 3D frame controllability and work force videos for ph control.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (1.00)
Media (0.68)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Neural Information Processing SystemsJun-23-2026, 02:51:03 GMT

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models.

diffusion model, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia (0.46)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4DScene Generation

Neural Information Processing SystemsJun-23-2026, 02:29:27 GMT

We propose the first framework capable of computing a 4D spatio-temporal grid of video architecture.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Graphics (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OPENS2V-NEXUS: ADetailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Neural Information Processing SystemsJun-22-2026, 22:43:42 GMT

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OPENS2V-NEXUS, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench [38] that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore, and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image on raw frames to synthesize multi-view representations. Through OPENS2V-NEXUS, we deliver a robust infrastructure to accelerate future S2V generation research.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Information Technology (1.00)
Energy (0.67)
Law (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

d607a260b3bc0b9a704a1a04dd64040a-Paper-Conference.pdf

Neural Information Processing SystemsJun-22-2026, 21:57:46 GMT

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Gi the ven corresponding an egocentric w scene orld and image generate from the egocentric user, PlayerOne videos can that accurately are strictly construct aligned with the real-scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on ing, lar follo ge-scale wed by egocentric finetuning text-video on synchronous pairs for coarse-le motion-video vel egocentric data extracted understandfrom egocentric-exocentric video datasets with our automatic construction pipeline.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.93)
Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Human Computer Interaction (0.93)
(2 more...)

Add feedback

Clean FrameClean FrameDenoised FrameDenoised FrameHigh Levelto Low LevelLow Levelto High LevelStyleTransferVideo GenerationFew-Shot Learning

Neural Information Processing SystemsJun-22-2026, 17:06:43 GMT

Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Genre: