Do better language models have crisper vision?