BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Alam, Sadia, Ishmam, Md Farhan, Alvee, Navid Hasin, Siddique, Md Shahnewaz, Hossain, Md Azam, Kamal, Abu Raihan Mostofa
–arXiv.org Artificial Intelligence
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8\%$ and an F1 score of $69.1\%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
arXiv.org Artificial Intelligence
Aug-16-2024
- Country:
- North America > United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Europe
- Sweden > Östergötland County
- Linköping (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Finland > Southwest Finland
- Turku (0.04)
- Sweden > Östergötland County
- Asia
- Middle East > Iran (0.04)
- Indonesia > Bali (0.04)
- China > Hong Kong (0.04)
- Japan > Honshū
- Kansai > Osaka Prefecture > Osaka (0.04)
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- North America > United States
- Genre:
- Research Report (0.65)
- Industry:
- Information Technology (1.00)
- Technology: