Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark
Augustyniak, Łukasz, Woźniak, Szymon, Gruza, Marcin, Gramacki, Piotr, Rajda, Krzysztof, Morzy, Mikołaj, Kajdanowicz, Tomasz
–arXiv.org Artificial Intelligence
Despite impressive advancements in multilingual corpora collection and model training, developing large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature based on strict quality criteria. The corpus covers 27 languages representing 6 language families. Datasets can be queried using several linguistic and functional features. In addition, we present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.
arXiv.org Artificial Intelligence
Jun-13-2023
- Country:
- Africa > Niger (0.04)
- North America
- United States
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Indiana > Boone County
- Lebanon (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Colorado > Denver County
- Denver (0.04)
- Arizona > Maricopa County
- Scottsdale (0.04)
- New Mexico > Santa Fe County
- Canada > British Columbia
- United States
- Europe
- Slovenia (0.04)
- Spain
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Portugal > Lisbon
- Lisbon (0.04)
- Poland
- Lower Silesia Province > Wroclaw (0.04)
- Greater Poland Province > Poznań (0.04)
- Italy > Campania
- Naples (0.04)
- France > Île-de-France
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.46)
- Information Technology (0.46)
- Technology: