A Survey of Resource-efficient LLM and Multimodal Foundation Models

Xu, Mengwei, Yin, Wangsong, Cai, Dongqi, Yi, Rongjie, Xu, Daliang, Wang, Qipeng, Wu, Bingyang, Zhao, Yihao, Yang, Chen, Wang, Shihe, Zhang, Qiyang, Lu, Zhenyan, Zhang, Li, Wang, Shangguang, Li, Yuanchun, Liu, Yunxin, Jin, Xin, Liu, Xuanzhe

arXiv.org Artificial Intelligence 

In the rapidly evolving field of artificial intelligence (AI), a paradigm shift is underway. We are witnessing the transition from specialized, fragmented deep learning models to versatile, one-size-fits-all foundation models. These advanced AI systems are capable of operating in an open-world context, interacting with open vocabularies and image pixels for unseen AI tasks, i.e., zero-shot abilities. They are exemplified by (1) Large Language Models (LLMs) such as GPTs [39] that can ingest almost every NLP task in the form as a prompt; (2) Vision Transformers Models (ViTs) such as Masked Autoencoder [133] that can handle various downstream vision tasks; (3) Latent Diffusion Models (LDMs) such as Stable Diffusion [310] that generate high-quality images with arbitrary text-based prompts; (4) Multimodal models such as CLIP [296] and ImageBind [116] that map different modal data into the same latent space and are widely used as backbone for cross-modality tasks like image retrieval/search and visual-question answering. Such flexibility and generality marks a significant departure from the earlier era of AI, setting a new standard for how AI interfaces with the world. The success of these foundation models is deeply rooted in their scalability: unlike their predecessors, these models' accuracy and generalization ability can continuously expand with more data or parameters, without altering the underlying simple algorithms and architectures.