One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Nehrdich, Sebastian, Hellwig, Oliver, Keutzer, Kurt

Sep-20-2024–arXiv.org Artificial Intelligence

Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

dataset, dependency, sanskrit, (16 more...)

arXiv.org Artificial Intelligence

Sep-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Alameda County > Berkeley (0.04)
- Europe
  - Slovenia (0.04)
  - United Kingdom > England
    - Oxfordshire > Oxford (0.14)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Germany
    - Lower Saxony > Gottingen (0.04)
    - North Rhine-Westphalia > Düsseldorf Region
      - Düsseldorf (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found