FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Jun-17-2026, 18:16:19 GMT–Neural Information Processing Systems

Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors--future lane dividers and 3D boxes--on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Jun-17-2026, 18:16:19 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Automobiles & Trucks (0.53)
- Energy (0.46)
- Information Technology > Robotics & Automation (0.44)
- Transportation > Ground
  - Road (0.53)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.96)
  - Machine Learning > Neural Networks (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found