L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models
Pingle, Aabha, Vyawahare, Aditya, Joshi, Isha, Tangsali, Rahul, Joshi, Raviraj
–arXiv.org Artificial Intelligence
The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .
arXiv.org Artificial Intelligence
Jun-24-2023
- Country:
- North America > United States
- Oregon > Multnomah County
- Portland (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California > Santa Clara County
- Stanford (0.04)
- Oregon > Multnomah County
- Europe
- Czechia > Prague (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Asia
- Middle East > UAE
- Dubai Emirate > Dubai (0.04)
- India
- Maharashtra > Pune (0.04)
- Tamil Nadu > Chennai (0.04)
- Middle East > UAE
- North America > United States
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Film (0.93)
- Technology: