FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Jun-12-2026, 11:05:30 GMT–Neural Information Processing Systems

Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors--future lane dividers and 3D boxes--on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Jun-12-2026, 11:05:30 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.35)