pidgin
Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages
Ogbonna, Ikechukwu, Davidson, Lesley, Banerjee, Soumya, Dasgupta, Abhishek, Kenney, Laurence, Nagaraja, Vikranth Harthikote
Millions of people in African countries face barriers to accessing healthcare due to language and literacy gaps. This research tackles this challenge by transforming complex medical documents -- in this case, prosthetic device user manuals -- into accessible formats for underserved populations. This case study in cross-cultural translation is particularly pertinent/relevant for communities that receive donated prosthetic devices but may not receive the accompanying user documentation. Or, if available online, may only be available in formats (e.g., language and readability) that are inaccessible to local populations (e.g., English-language, high resource settings/cultural context). The approach is demonstrated using the widely spoken Pidgin dialect, but our open-source framework has been designed to enable rapid and easy extension to other languages/dialects. This work presents an AI-powered framework designed to process and translate complex medical documents, e.g., user manuals for prosthetic devices, into marginalised languages. The system enables users -- such as healthcare workers or patients -- to upload English-language medical equipment manuals, pose questions in their native language, and receive accurate, localised answers in real time. Technically, the system integrates a Retrieval-Augmented Generation (RAG) pipeline for processing and semantic understanding of the uploaded manuals. It then employs advanced Natural Language Processing (NLP) models for generative question-answering and multilingual translation. Beyond simple translation, it ensures accessibility to device instructions, treatment protocols, and safety information, empowering patients and clinicians to make informed healthcare decisions.
- Africa > Nigeria (0.06)
- North America > United States (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study
Tikhonov, Alexey, Shteiner, Sergei, Bykova, Anna, Yamshchikov, Ivan P.
Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a "reconstruction" translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.
- Europe > Finland > Uusimaa > Helsinki (0.05)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Germany > Berlin (0.04)
- (7 more...)
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks
ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity $χ$ a family of fixed-point operators $ϕ_{n,m}$ that model bounded semantic drift via the relation $ϕ_{n,m}(χ) = χ\oplus Δ(χ)$, where $Δ(χ)$ is a drift vector in a latent semantic manifold. The base anchor $ϕ_{0,0}$ recovers the canonical ISO 639:2023 identity, whereas $ϕ_{99,9}$ marks the maximal drift state that triggers a deterministic fallback. Using category theory, we treat the operators $ϕ_{n,m}$ as morphisms and drift vectors as arrows in a category $\mathrm{DriftLang}$. A functor $Φ: \mathrm{DriftLang} \to \mathrm{AnchorLang}$ maps every drifted object to its unique anchor and proves convergence. We provide an RDF/Turtle schema (\texttt{BaseLanguage}, \texttt{DriftedLanguage}, \texttt{ResolvedAnchor}) and worked examples -- e.g., $ϕ_{8,4}$ (Standard Mandarin) versus $ϕ_{8,7}$ (a colloquial variant), and $ϕ_{1,7}$ for Nigerian Pidgin anchored to English. Experiments with transformer models show higher accuracy in language identification and translation on noisy or code-switched input when the $ϕ$-indices are used to guide fallback routing. The framework is compatible with ISO/TC 37 and provides an AI-tractable, drift-aware semantic layer for future standards.
Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin
Lin, Pin-Jie, Saeed, Muhammed, Chang, Ernie, Scholman, Merel
Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.
- Europe > Germany > Saarland (0.05)
- Africa > Nigeria (0.04)
- North America > Dominican Republic (0.04)
- (3 more...)
Pidgin - West African lingua franca
The BBC is launching 11 new language services and one of them is English-based Pidgin, which is one of the most widely spoken languages across West Africa, even though it is not officially recognised. The Oxford English Dictionary definition of Pidgin is: A language containing lexical and other features from two or more languages, characteristically with simplified grammar and a smaller vocabulary than the languages from which it is derived, used for communication between people not having a common language; a lingua franca. Simply put, Pidgin English is a mixture of English and local languages which enables people who do not share a common language to communicate. Most African countries are made up of numerous different ethnic groups who do not necessarily have a lingua franca, so Pidgin has developed. It is widely spoken in Nigeria, Ghana, Equatorial Guinea and Cameroon.
- Africa > Nigeria (0.32)
- Africa > West Africa (0.26)
- Africa > Ghana (0.26)
- (3 more...)