ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
–arXiv.org Artificial Intelligence
This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/) 2 R. Al-Sabbagh / Data in Brief 54 (2024) 1 10271 Subject Computer Science, Social Sciences Specific subject area Natural Language Processing, machine translation, large-language models, translation studies, cross-linguistic analysis, lexical semantics Data format Translated and aligned Type of data Texts (Bilingual tables in Microsoft Excel files) Data collection The ArzEn-MultiGenre dataset consists of three genres: song lyrics, novels, and subtitles. The data was gathered from various sources using different methods. A website was crawled for song lyrics using an in-house web crawler, and professional translators manually translated the lyrics into English. For novels, hard copies were collected in English and Egyptian Arabic, then scanned and converted into text files using an Optical Character Recognizer (OCR). The OCR output was then manually reviewed and aligned.
arXiv.org Artificial Intelligence
Aug-5-2025
- Country:
- Africa > Middle East
- Egypt (0.05)
- Asia > Middle East
- UAE > Sharjah Emirate > Sharjah (0.04)
- Africa > Middle East
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment (1.00)
- Media
- Music (0.57)
- Television (0.48)
- Technology: