ERUPD -- English to Roman Urdu Parallel Dataset
Furqan, Mohammed, Khaja, Raahid Bin, Habeeb, Rayyan
–arXiv.org Artificial Intelligence
Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.
arXiv.org Artificial Intelligence
Dec-23-2024
- Country:
- Asia
- India > Telangana
- Hyderabad (0.05)
- Indonesia > Java
- Yogyakarta > Yogyakarta (0.04)
- Middle East > Oman (0.04)
- Pakistan (0.04)
- India > Telangana
- Europe
- Czechia > Prague (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- North America > United States (0.04)
- Asia
- Genre:
- Overview (0.46)
- Research Report (0.65)
- Industry:
- Education (0.48)
- Technology: