Singhal, Saksham
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Microsoft, null, :, null, Abouelenin, Abdelrahman, Ashfaq, Atabak, Atkinson, Adam, Awadalla, Hany, Bach, Nguyen, Bao, Jianmin, Benhaim, Alon, Cai, Martin, Chaudhary, Vishrav, Chen, Congcong, Chen, Dong, Chen, Dongdong, Chen, Junkun, Chen, Weizhu, Chen, Yen-Chun, Chen, Yi-ling, Dai, Qi, Dai, Xiyang, Fan, Ruchao, Gao, Mei, Gao, Min, Garg, Amit, Goswami, Abhishek, Hao, Junheng, Hendy, Amr, Hu, Yuxuan, Jin, Xin, Khademi, Mahmoud, Kim, Dongwoo, Kim, Young Jin, Lee, Gina, Li, Jinyu, Li, Yunsheng, Liang, Chen, Lin, Xihui, Lin, Zeqi, Liu, Mengchen, Liu, Yang, Lopez, Gilsinia, Luo, Chong, Madan, Piyush, Mazalov, Vadim, Mitra, Arindam, Mousavi, Ali, Nguyen, Anh, Pan, Jing, Perez-Becker, Daniel, Platin, Jacob, Portet, Thomas, Qiu, Kai, Ren, Bo, Ren, Liliang, Roy, Sambuddha, Shang, Ning, Shen, Yelong, Singhal, Saksham, Som, Subhojit, Song, Xia, Sych, Tetyana, Vaddamanu, Praneetha, Wang, Shuohang, Wang, Yiming, Wang, Zhenghao, Wu, Haibin, Xu, Haoran, Xu, Weijian, Yang, Yifan, Yang, Ziyi, Yu, Donghan, Zabir, Ishmam, Zhang, Jianwen, Zhang, Li Lyna, Zhang, Yunan, Zhou, Xiren
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
On The Adaptation of Unlimiformer for Decoder-Only Transformers
Ahrabian, Kian, Benhaim, Alon, Patra, Barun, Pujara, Jay, Singhal, Saksham, Song, Xia
One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models.
Language Is Not All You Need: Aligning Perception with Language Models
Huang, Shaohan, Dong, Li, Wang, Wenhui, Hao, Yaru, Singhal, Saksham, Ma, Shuming, Lv, Tengchao, Cui, Lei, Mohammed, Owais Khan, Patra, Barun, Liu, Qiang, Aggarwal, Kriti, Chi, Zewen, Bjorck, Johan, Chaudhary, Vishrav, Som, Subhojit, Song, Xia, Wei, Furu
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.