Vision Language Models Cannot Plan, but Can They Formalize?

Open in new window