Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties
Lopetegui, Javier A., Riabi, Arij, Seddah, Djamé
–arXiv.org Artificial Intelligence
Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \cite{swayamdipta-etal-2020-dataset} implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.
arXiv.org Artificial Intelligence
Dec-16-2024
- Country:
- Asia
- India > Maharashtra
- Mumbai (0.04)
- Middle East
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- India > Maharashtra
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- France
- Provence-Alpes-Côte d'Azur (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Slovenia (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Costa Rica (0.04)
- Cuba (0.06)
- El Salvador (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- Panama (0.04)
- United States > Michigan
- Washtenaw County > Ann Arbor (0.04)
- Oceania > Australia (0.04)
- South America
- Asia
- Genre:
- Research Report (0.82)
- Technology: