Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

AlOtaibi, Areej, Alyahya, Lina, Alshabanah, Raghad, Alfawzan, Shahad, Alarefei, Shuruq, Alsabti, Reem, Alsubaie, Nouf, Alhuzaymi, Abdulaziz, Alkhelb, Lujain, Alsayari, Majd, Alahmed, Waad, Talabay, Omar, Alowibdi, Jalal, Alelyani, Salem, Bibi, Adel

Oct-28-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Oct-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Japan > Honshū
    - Chūbu > Toyama Prefecture > Toyama (0.04)
  - Middle East
    - Jordan (0.04)
    - Saudi Arabia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Slovenia (0.04)
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
- North America > United States
  - Florida > Miami-Dade County
    - Miami (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Virginia (0.04)
- Oceania > Tonga (0.04)

Genre:
- Research Report > New Finding (1.00)
- Workflow (0.92)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Performance Analysis > Accuracy (0.93)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found