Making Large Multimodal Models Understand Arbitrary Visual Prompts