Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Geigle, Gregor, Schneider, Florian, Holtermann, Carolin, Biemann, Chris, Timofte, Radu, Lauscher, Anne, Glavaš, Goran
–arXiv.org Artificial Intelligence
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
arXiv.org Artificial Intelligence
Jan-9-2025
- Country:
- Africa
- Asia
- Pakistan (0.04)
- Mongolia (0.04)
- Indonesia > Bali (0.04)
- Malaysia (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East
- Israel > Tel Aviv District
- Tel Aviv (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Israel > Tel Aviv District
- Philippines (0.04)
- Russia (0.04)
- China (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Myanmar (0.04)
- Sri Lanka (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Singapore (0.04)
- India (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Romania (0.04)
- Russia (0.04)
- Italy
- France (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Norway (0.04)
- Spain (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Bulgaria (0.04)
- Germany > Bavaria
- Lower Franconia > Würzburg (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Austria > Vienna (0.14)
- Switzerland > Zürich
- Zürich (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > Los Angeles County
- Long Beach (0.04)
- Florida > Miami-Dade County
- Miami (0.14)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- New York > New York County
- New York City (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- California > Los Angeles County
- Canada > Ontario
- Oceania > Australia
- New South Wales > Sydney (0.04)
- South America
- Argentina (0.04)
- Brazil (0.04)
- Chile > Santiago Metropolitan Region
- Santiago Province > Santiago (0.04)
- Colombia (0.04)
- Ecuador (0.04)
- Peru > Cusco Department
- Cusco Province > Cusco (0.04)
- Uruguay (0.04)
- Genre:
- Research Report > New Finding (0.87)
- Industry:
- Education (0.67)
- Technology: