Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach
Oketch, Kezia, Lalor, John P., Abbasi, Ahmed
–arXiv.org Artificial Intelligence
We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.
arXiv.org Artificial Intelligence
Aug-21-2025
- Country:
- Africa
- Central Africa (0.04)
- Democratic Republic of the Congo (0.04)
- East Africa (0.04)
- Kenya (0.04)
- Tanzania (0.04)
- Uganda (0.04)
- Asia
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Belgium > Brussels-Capital Region
- North America
- Dominican Republic (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Texas > Travis County
- Austin (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Africa
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine (1.00)
- Technology: