ViSTa Dataset: Do vision-language models understand sequential tasks?

Open in new window