serengeti
From N-grams to Pre-trained Multilingual Models For Language Identification
Sindane, Thapelo, Marivate, Vukosi
In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.
- North America > Dominican Republic (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Spain (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.31)
SERENGETI: Massively Multilingual Language Models for Africa
Adebara, Ife, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad, Inciarte, Alcides Alcoba
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Niger (0.05)
- Africa > Nigeria (0.04)
- (43 more...)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Media > News (0.92)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Using machine learning to accelerate ecological research
Using machine learning to accelerate ecological research Using machine learning to accelerate ecological research Share Pushmeet Kohli * External authors The Serengeti is one of the last remaining sites in the world that hosts an intact community of large mammals. These animals roam over vast swaths of land, some migrating thousands of miles across multiple countries following seasonal rainfall. As human encroachment around the park becomes more intense, these species are forced to alter their behaviours in order to survive. Increasing agriculture, poaching, and climate abnormalities contribute to changes in animal behaviours and population dynamics, but these changes have occurred at spatial and temporal scales which are difficult to monitor using traditional research methods. There is a great urgency to understand how these animal communities function as human pressures grow, both in order to understand the dynamics of these last pristine ecosystems, and to formulate effective management plans to conserve and protect the integrity of this unique biodiversity hotspot.
A.I. Became the Perfect Lab Assistant for Wildlife Conservationists
It's on us to be better caretakers for this beautiful, warming planet that we (and a few million other species) call home. Thankfully, a computer vision algorithm learned how to do a job that once required the help of tens of thousands of citizen wildlife scientists in a fraction of the time. The A.I. successfully labeled roughly three million images taken by Snapshot Serengeti, a project whose goal is to preserve biodiversity and seek out new phenomena by more carefully monitoring endangered species by filling the Serengeti with unobtrusive cameras. This is all thanks to team of computer scientists, led by Mohammad Sadegh Norouzzadeh at the University of Wyoming, who together developed an algorithm to analyze the images. Now, this animal-identifying A.I. published in the journal Proceedings of the National Acamedy of Science allows these citizen scientists to devote their time to conservation endeavors instead of spending hours sorting through photos.
Deep learning tells giraffes from gazelles in the Serengeti
Computers are playing spot the difference in the Serengeti. An image-recognition algorithm that can identify different species could make it easier to track animals in the wild. Using a database of 3.2 million photos taken by hidden camera traps in the Serengeti National Park in Tanzania, Jeff Clune at the University of Wyoming in Laramie and his colleagues trained the deep-learning system to distinguish between 48 animal species, such as elephants, giraffes and gazelles. In tests, it correctly identified the species present in an image 92 per cent of the time. Camera traps automatically take pictures of passing animals when triggered by heat and motion.
- North America > United States > Wyoming (0.26)
- Africa > Tanzania (0.26)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.06)