Learning Multimodal LLMs without Text-only Forgetting