cognate
Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search
We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their p redominantly data - driven nature. In contrast, our model integrates data - driven inference with rule - based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivat ed constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established ba selines across both character - level accuracy and phonological plausibility metrics. Keywords: protoform reconstruction, historical linguistics, evolutionary algorithms, phonological modeling, rule - based inference .
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Sweden (0.04)
Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense
Cahyawijaya, Samuel, Zhang, Ruochen, Lovenia, Holy, Cruz, Jan Christian Blaise, Gilbert, Elisa, Nomoto, Hiroki, Aji, Alham Fikri
Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Asia > Malaysia (0.05)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (19 more...)
Semisupervised Neural Proto-Language Reconstruction
Lu, Liang, Xie, Peirong, Mortensen, David R.
Existing work implementing comparative reconstruction of ancestral languages (proto-languages) has usually required full supervision. However, historical reconstruction models are only of practical value if they can be trained with a limited amount of labeled data. We propose a semisupervised historical reconstruction task in which the model is trained on only a small amount of labeled data (cognate sets with proto-forms) and a large amount of unlabeled data (cognate sets without proto-forms). We propose a neural architecture for comparative reconstruction (DPD-BiReconstructor) incorporating an essential insight from linguists' comparative method: that reconstructed words should not only be reconstructable from their daughter words, but also deterministically transformable back into their daughter words. We show that this architecture is able to leverage unlabeled cognate sets to outperform strong semisupervised baselines on this novel task.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > California (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (12 more...)
Neural Proto-Language Reconstruction
Cui, Chenxuan, Chen, Ying, Wang, Qinxin, Mortensen, David R.
Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Ireland (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Improved Neural Protoform Reconstruction via Reflex Prediction
Lu, Liang, Wang, Jingzhi, Mortensen, David R.
Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computational linguists have attempted to operationalize comparative reconstruction through various computational models, the most successful of which have been supervised encoder-decoder models, which treat the problem of predicting protoforms given sets of reflexes as a sequence-to-sequence problem. We argue that this framework ignores one of the most important aspects of the comparative method: not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. Leveraging another line of research -- reflex prediction -- we propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model. We show that this more complete implementation of the comparative method allows us to surpass state-of-the-art protoform reconstruction methods on three of four Chinese and Romance datasets.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (30 more...)
Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space
We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > Dominican Republic (0.04)
- (8 more...)
A Computational Model for the Assessment of Mutual Intelligibility Among Closely Related Languages
Nieder, Jessica, List, Johann-Mattis
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. To study mutual intelligibility computationally, we propose a computer-assisted method using the Linear Discriminative Learner, a computational model developed to approximate the cognitive processes by which humans learn languages, which we expand with multilingual semantic vectors and multilingual sound classes. We test the model on cognate data from German, Dutch, and English, three closely related Germanic languages. We find that our model's comprehension accuracy depends on 1) the automatic trimming of inflections and 2) the language pair for which comprehension is tested. Our multilingual modelling approach does not only offer new methodological findings for automatic testing of mutual intelligibility across languages but also extends the use of Linear Discriminative Learning to multilingual settings.
- Europe > Germany > Saxony > Leipzig (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- Europe > Netherlands (0.04)
Are Sounds Sound for Phylogenetic Reconstruction?
Häuser, Luise, Jäger, Gerhard, Rama, Taraka, List, Johann-Mattis, Stamatakis, Alexandros
In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Germany > Saxony > Leipzig (0.04)
- Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.04)
- (5 more...)
Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages
Goswami, Koustava, Rani, Priya, Fransen, Theodorus, McCrae, John P.
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closely-related languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the state-of-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training. The code and dataset building scripts can be found at https://github.com/koustavagoswami/Weakly_supervised-Cognate_Detection
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Ireland > Connaught > County Galway > Galway (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (13 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
Representing and Computing Uncertainty in Phonological Reconstruction
List, Johann-Mattis, Hill, Nathan W., Forkel, Robert, Blum, Frederic
Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.
- Europe > Germany > Saxony > Leipzig (0.05)
- South America > Brazil > Federal District > Brasília (0.04)
- North America > United States (0.04)
- (7 more...)