PaLI-X: On Scaling up a Multilingual Vision and Language Model

Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, Soricut, Radu

May-29-2023–arXiv.org Artificial Intelligence

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

May-29-2023

arXiv.org PDF

Add feedback

Country:
- South America > Argentina (0.04)
- Oceania > Australia (0.04)
- Europe > France (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - California > Los Angeles County
    - Long Beach (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.92)
- Media (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Image Understanding (0.46)
  - Natural Language
    - Large Language Model (0.68)
    - Question Answering (0.55)
  - Machine Learning
    - Performance Analysis > Accuracy (0.46)
    - Neural Networks (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found