Improving Fine-grained Visual Understanding in VLMs through Text-Only Training