Benchmarking Arabic AI with Large Language Models
Abdelali, Ahmed, Mubarak, Hamdy, Chowdhury, Shammur Absar, Hasanain, Maram, Mousi, Basel, Boughorbel, Sabri, Kheir, Yassine El, Izham, Daniel, Dalvi, Fahim, Hawasly, Majd, Nazar, Nizi, Elshahawy, Yousseif, Ali, Ahmed, Durrani, Nadir, Milic-Frayling, Natasa, Alam, Firoj
–arXiv.org Artificial Intelligence
With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.
arXiv.org Artificial Intelligence
May-24-2023
- Country:
- Africa > Middle East (1.00)
- Asia > Middle East
- Iraq (0.28)
- Europe (1.00)
- North America (1.00)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Government (1.00)
- Health & Medicine (1.00)
- Information Technology (0.67)
- Media > News (1.00)
- Technology: