MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

Daoud, Mouath Abu, Abouzahir, Chaimae, Kharouf, Leen, Al-Eisawi, Walid, Habash, Nizar, Shamout, Farah E.

Aug-25-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their effectiveness in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a new benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple-choice questions, fill-in-the-blank questions, and patient-doctor questions and answers. We first constructed the dataset using past medical exams as well as publicly available datasets. We conducted an extensive evaluation with eight state-of-the-art open-access and proprietary high-resource LLMs, including GPT-4, Deepseek v3, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare. Data Availability In this article, we present a new benchmark dataset, MedArabiQ, designed to evaluate the performance of LLMs on Arabic medical tasks.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-25-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Middle East
  - Egypt (0.04)
- Asia
  - China > Beijing
    - Beijing (0.04)
  - Indonesia > Bali (0.04)
  - Middle East
    - Qatar > Ad-Dawhah
      - Doha (0.04)
    - Saudi Arabia > Eastern Province
      - Jubail (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.14)
  - Singapore (0.04)
- Europe > France
  - Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States
  - New York (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (1.00)
- Health & Medicine
  - Health Care Technology > Telehealth (1.00)
  - Therapeutic Area (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found