ViSTa Dataset: Do vision-language models understand sequential tasks?