ERUPD -- English to Roman Urdu Parallel Dataset

Furqan, Mohammed, Khaja, Raahid Bin, Habeeb, Rayyan

Dec-23-2024–arXiv.org Artificial Intelligence

Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-23-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Europe
  - Czechia > Prague (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Middle East > Oman (0.04)
  - Pakistan (0.04)
  - Indonesia > Java
    - Yogyakarta > Yogyakarta (0.04)
  - India > Telangana
    - Hyderabad (0.05)

Genre:
- Research Report (0.65)
- Overview (0.46)

Industry:
- Education (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Machine Translation (1.00)
    - Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found