FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
–Neural Information Processing Systems
Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors--future lane dividers and 3D boxes--on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution.
Neural Information Processing Systems
Jun-17-2026, 18:16:19 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Automobiles & Trucks (0.53)
- Energy (0.46)
- Information Technology > Robotics & Automation (0.44)
- Transportation > Ground
- Road (0.53)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language > Large Language Model (0.96)
- Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence