AITopics | future frame

Collaborating Authors

future frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Neural Information Processing SystemsJun-17-2026, 18:16:19 GMT

Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors--future lane dividers and 3D boxes--on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Transportation > Ground > Road (0.53)
Automobiles & Trucks (0.53)
Energy (0.46)
Information Technology > Robotics & Automation (0.44)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Taming generative video models for zero-shot optical flow extraction

Neural Information Processing SystemsJun-14-2026, 02:04:46 GMT

Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time inference procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method is competitive with state-of-the-art, task-specific models on the real-world TAP-Vid DAVIS benchmark and the synthetic TAP-Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric-loss methods for high-quality flow.

large language model, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.58)

Technology:

Information Technology > Artificial Intelligence > Vision (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.58)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)
Information Technology > Artificial Intelligence > Machine Learning (0.38)

Add feedback

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Neural Information Processing SystemsJun-12-2026, 11:05:30 GMT

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback

Probabilistic Modeling of Future Frames from a Single Image

Tianfan Xue, Jiajun Wu, Katherine Bouman, Bill Freeman

Neural Information Processing SystemsMar-22-2026, 23:50:00 GMT

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, as well as on real-world video frames. We also show that our model can be applied to visual analogy-making, and present an analysis of the learned network representations.

artificial intelligence, future frame, machine learning, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

Neural Information Processing SystemsMar-17-2026, 06:33:25 GMT

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach which models future frames in a probabilistic manner. Our proposed method is therefore able to synthesize multiple possible next frames using the same model. Solving this challenging problem involves low-and high-level image and motion understanding for successful image synthesis. Here, we propose a novel network structure, namely a Cross Convolutional Network, that encodes images as feature maps and motion information as convolutional kernels to aid in synthesizing future frames. In experiments, our model performs well on both synthetic data, such as 2D shapes and animated game sprites, as well as on real-wold video data. We show that our model can also be applied to tasks such as visual analogy-making, and present analysis of the learned network representations.

artificial intelligence, machine learning, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Siamese Masked Autoencoders

Neural Information Processing SystemsFeb-15-2026, 12:44:53 GMT

SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Yunji Kim, Seonghyeon Nam, In Cho, Seon Joo Kim

Neural Information Processing SystemsFeb-12-2026, 17:57:18 GMT

Neural Information Processing Systems http://nips.cc/

dataset, keypoint, video, (13 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Neural Information Processing SystemsFeb-5-2026, 17:00:30 GMT

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.

artificial intelligence, machine learning, masked conditional video diffusion, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.96)

Add feedback

Siamese Masked Autoencoders

Neural Information Processing SystemsDec-26-2025, 05:48:53 GMT

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances.

electronic proceedings, name change, siamese masked autoencoder, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.58)

Add feedback

Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

Zheng, Haowen, Zhu, Hu, Deng, Lu, Gu, Weihao, Yang, Yang, Liang, Yanyan

arXiv.org Artificial IntelligenceDec-10-2025

Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.

artificial intelligence, distillation, temporal reasoning, (17 more...)

arXiv.org Artificial Intelligence

2512.08247

Genre: Research Report (0.82)

Industry:

Education (1.00)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Temporal Reasoning (0.62)

Add feedback