Benchmarking Arabic AI with Large Language Models

Abdelali, Ahmed, Mubarak, Hamdy, Chowdhury, Shammur Absar, Hasanain, Maram, Mousi, Basel, Boughorbel, Sabri, Kheir, Yassine El, Izham, Daniel, Dalvi, Fahim, Hawasly, Majd, Nazar, Nizi, Elshahawy, Yousseif, Ali, Ahmed, Durrani, Nadir, Milic-Frayling, Natasa, Alam, Firoj

arXiv.org Artificial Intelligence 

With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found