Why are Visually-Grounded Language Models Bad at Image Classification?

Open in new window