Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Shmidman, Shaltiel, Shmidman, Avi, Cohen, Amir DN, Koppel, Moshe

Jul-9-2024–arXiv.org Artificial Intelligence

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

dictalm2, hebrew, language model, (15 more...)

arXiv.org Artificial Intelligence

Jul-9-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East
    - Israel > Jerusalem District
      - Jerusalem (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
  - Singapore (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Middle East > Republic of Türkiye
    - Istanbul Province > Istanbul (0.04)
- North America > United States
  - Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found