Liang, Qiao
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Liang, Qiao, Liu, Yanjiang, He, Ben, Lu, Yaojie, Lin, Hongyu, Zheng, Jia, Han, Xianpei, Sun, Le, Sun, Yingfei
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Tang, Qiaoyu, Deng, Ziliang, Lin, Hongyu, Han, Xianpei, Liang, Qiao, Cao, Boxi, Sun, Le
Enabling large language models to utilize real-world tools effectively is crucial for achieving embodied intelligence. Existing approaches to tool learning have either primarily relied on extremely large language models, such as GPT-4, to attain generalized tool-use abilities in a zero-shot manner, or utilized supervised learning to train limited scopes of tools on compact models. However, it remains uncertain whether smaller language models can achieve generalized tool-use abilities without tool-specific training. To address this question, this paper introduces ToolAlpaca, a novel framework designed to automatically generate a diverse tool-use corpus and learn generalized tool-use abilities on compact language models with minimal human intervention. Specifically, ToolAlpaca first automatically creates a highly diversified tool-use corpus by building a multi-agent simulation environment. The corpus contains 3938 tool-use instances from more than 400 real-world tool APIs spanning 50 distinct categories. Subsequently, the constructed corpus is employed to fine-tune compact language models, resulting in two models, namely ToolAlpaca-7B and ToolAlpaca-13B, respectively. Finally, we evaluate the ability of these models to utilize previously unseen tools without specific training. Experimental results demonstrate that ToolAlpaca achieves effective generalized tool-use capabilities comparable to those of extremely large language models like GPT-3.5, demonstrating that learning generalized tool-use ability is feasible for compact language models.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Shen, Jonathan, Nguyen, Patrick, Wu, Yonghui, Chen, Zhifeng, Chen, Mia X., Jia, Ye, Kannan, Anjuli, Sainath, Tara, Cao, Yuan, Chiu, Chung-Cheng, He, Yanzhang, Chorowski, Jan, Hinsu, Smit, Laurenzo, Stella, Qin, James, Firat, Orhan, Macherey, Wolfgang, Gupta, Suyog, Bapna, Ankur, Zhang, Shuyuan, Pang, Ruoming, Weiss, Ron J., Prabhavalkar, Rohit, Liang, Qiao, Jacob, Benoit, Liang, Bowen, Lee, HyoukJoong, Chelba, Ciprian, Jean, Sébastien, Li, Bo, Johnson, Melvin, Anil, Rohan, Tibrewal, Rajat, Liu, Xiaobing, Eriguchi, Akiko, Jaitly, Navdeep, Ari, Naveen, Cherry, Colin, Haghani, Parisa, Good, Otavio, Cheng, Youlong, Alvarez, Raziel, Caswell, Isaac, Hsu, Wei-Ning, Yang, Zongheng, Wang, Kuan-Chieh, Gonina, Ekaterina, Tomanek, Katrin, Vanik, Ben, Wu, Zelin, Jones, Llion, Schuster, Mike, Huang, Yanping, Chen, Dehao, Irie, Kazuki, Foster, George, Richardson, John, Macherey, Klaus, Bruguier, Antoine, Zen, Heiga, Raffel, Colin, Kumar, Shankar, Rao, Kanishka, Rybach, David, Murray, Matthew, Peddinti, Vijayaditya, Krikun, Maxim, Bacchiani, Michiel A. U., Jablin, Thomas B., Suderman, Rob, Williams, Ian, Lee, Benjamin, Bhatia, Deepti, Carlson, Justin, Yavuz, Semih, Zhang, Yu, McGraw, Ian, Galkin, Max, Ge, Qi, Pundak, Golan, Whipkey, Chad, Wang, Todd, Alon, Uri, Lepikhin, Dmitry, Tian, Ye, Sabour, Sara, Chan, William, Toshniwal, Shubham, Liao, Baohua, Nirschl, Michael, Rondon, Pat
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.