Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models