4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Neural Information Processing Systems 

Leveraging diverse robotic data for pretraining remains a critical challenge.