AITopics | Liu, Chris

Collaborating Authors

Liu, Chris

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Gao, Peng, Zhang, Renrui, Liu, Chris, Qiu, Longtian, Huang, Siyuan, Lin, Weifeng, Zhao, Shitian, Geng, Shijie, Lin, Ziyi, Jin, Peng, Zhang, Kaipeng, Shao, Wenqi, Xu, Chao, He, Conghui, He, Junjun, Shao, Hao, Lu, Pan, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceFeb-8-2024

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

large language model, machine learning, sphinx-x, (14 more...)

arXiv.org Artificial Intelligence

2402.05935

Country: Europe (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Lin, Ziyi, Liu, Chris, Zhang, Renrui, Gao, Peng, Qiu, Longtian, Xiao, Han, Qiu, Han, Lin, Chen, Shao, Wenqi, Chen, Keqin, Han, Jiaming, Huang, Siyuan, Zhang, Yichi, He, Xuming, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceNov-13-2023

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

large language model, natural language, visual embedding, (5 more...)

arXiv.org Artificial Intelligence

2311.07575

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ImageBind-LLM: Multi-modality Instruction Tuning

Han, Jiaming, Zhang, Renrui, Shao, Wenqi, Gao, Peng, Xu, Peng, Xiao, Han, Zhang, Kaipeng, Liu, Chris, Wen, Song, Guo, Ziyu, Lu, Xudong, Ren, Shuai, Wen, Yafei, Chen, Xiaoxin, Yue, Xiangyu, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceSep-11-2023

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2309.03905

Country: Asia > China (0.46)

Genre: Research Report (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Zhang, Renrui, Han, Jiaming, Liu, Chris, Gao, Peng, Zhou, Aojun, Hu, Xiangfei, Yan, Shilin, Lu, Pan, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceJun-14-2023

Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2303.16199

Country:

North America > Mexico (0.69)
North America > Canada > Newfoundland and Labrador (0.14)
North America > United States > California (0.14)

Genre:

Research Report (0.50)
Personal (0.47)

Industry: Government > Regional Government > North America Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

Add feedback