WorldEmbeddingVLAInstructionImageVLAActionImage/Video Generation InstructionImagePolicyVLAInstructionImageAction InstructionImageActionAction(a)(b)(c)(d)Dream Queries
–Neural Information Processing Systems
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perceptionprediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.
Neural Information Processing Systems
Jun-15-2026, 15:36:36 GMT
- Country:
- Europe (1.00)
- North America > United States (0.46)
- Genre:
- Research Report > Experimental Study (0.93)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Robots (1.00)
- Representation & Reasoning > Spatial Reasoning (1.00)
- Natural Language > Large Language Model (1.00)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence