S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation

Maheshwari, Ayush, Gupta, Ashim, Krishna, Amrith, Ramakrishnan, Ganesh, Kumar, G. Anil, Singla, Jitin

May-23-2023–arXiv.org Artificial Intelligence

Sanskrit is a low-resource language with a rich heritage. Digitized Sanskrit corpora reflective of the contemporary usage of Sanskrit, specifically that too in prose, is heavily under-represented at present. Presently, no such English-Sanskrit parallel dataset is publicly available. We release a dataset, S\={a}mayik, of more than 42,000 parallel English-Sanskrit sentences, from four different corpora that aim to bridge this gap. Moreover, we also release benchmarks adapted from existing multilingual pretrained models for Sanskrit-English translation. We include training splits from our contemporary dataset and the Sanskrit-English parallel sentences from the training split of Itih\={a}sa, a previously released classical era machine translation dataset containing Sanskrit.

artificial intelligence, natural language, sanskrit, (17 more...)

arXiv.org Artificial Intelligence

May-23-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Utah (0.04)
- Europe
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East > UAE (0.04)
  - India
    - Uttarakhand > Roorkee (0.05)
    - West Bengal > Kolkata (0.04)
    - NCT > New Delhi (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found