Goto

Collaborating Authors

 loanword


OLaPh: Optimal Language Phonemizer

arXiv.org Artificial Intelligence

Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.


Feature-Refined Unsupervised Model for Loanword Detection

arXiv.org Artificial Intelligence

We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.


Statistical analysis of word flow among five Indo-European languages

arXiv.org Artificial Intelligence

A recent increase in data availability has allowed the possibility to perform different statistical linguistic studies. Here we use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as ``migrant words'', a type of loanwords that do not change their spelling. We quantify migrant words from one language to another for different decades, and notice that most migrant words can be aggregated in semantic fields and associated to historic events. We also study the statistical properties of accumulated migrant words and their rank dynamics. We propose a measure of use of migrant words that could be used as a proxy of cultural influence. Our methodology is not exempt of caveats, but our results are encouraging to promote further studies in this direction.


The Communities That Live Captioning Leaves Behind

Slate

During our synagogue's Zoom services last week, my family and I found ourselves giggling when we should have been serious. Auto-captions were turned on, and they kept botching the rabbi's Hebrew-laced English. Mourner's Kaddish (memorial prayer) was transcribed as mourner Scottish, and refua shlema (wish for a "full recovery") became with flu wash Emma. Some of the transcriptions bordered on offensive, like when Torah became terrorism and yasher koach (great job!) became wish a cough. We weren't relying on the captions and could laugh at these mistakes.


Cross-Lingual Bridges with Models of Lexical Borrowing

Journal of Artificial Intelligence Research

Linguistic borrowing is the phenomenon of transferring linguistic constructions (lexical, phonological, morphological, and syntactic) from a donor language to a recipient language as a result of contacts between communities speaking different languages. Borrowed words are found in all languages, andin contrast to cognate relationshipsborrowing relationships may exist across unrelated languages (for example, about 40% of Swahilis vocabulary is borrowed from the unrelated language Arabic). In this work, we develop a model of morpho-phonological transformations across languages. Its features are based on universal constraints from Optimality Theory (OT), and we show that compared to several standardbut linguistically more naïvebaselines, our OT-inspired model obtains good performance at predicting donor forms from borrowed forms with only a few dozen training examples, making this a cost-effective strategy for sharing lexical information across languages. We demonstrate applications of the lexical borrowing model in machine translation, using resource-rich donor language to obtain translations of out-of-vocabulary loanwords in a lower resource language. Our framework obtains substantial improvements (up to 1.6 BLEU) over standard baselines.