CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation

Bhatt, Krishnakant, J, Karthika N, Ramakrishnan, Ganesh, Jyothi, Preethi

Jul-8-2024–arXiv.org Artificial Intelligence

Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word Segmentation (CharSS). We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset, our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-8-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Australian Capital Territory > Canberra (0.04)
- Europe
  - Finland > Uusimaa
    - Helsinki (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > India
  - NCT
    - New Delhi (0.04)
    - Delhi (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Grammars & Parsing (0.82)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found