Learning Multimodal LLMs without Text-only Forgetting Yang Li

Neural Information Processing Systems 

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, during the continued training, the MLLM catastrophically forgets the text-only instructions that the initial LLM masters.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found