historical linguistics
Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change
Hariharan, Ananth, Mortensen, David
This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.
Feature-Refined Unsupervised Model for Loanword Detection
We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.
PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
Bothwell, Stephen, DuSell, Brian, Chiang, David, Krostenko, Brian
Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.
Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer
Akavarapu, V. S. D. S. Mahesh, Bhattacharya, Arnab
Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods for cognate identification are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision, thereby proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.
Cognate Transformer for Automated Phonological Reconstruction and Cognate Reflex Prediction
Akavarapu, V. S. D. S. Mahesh, Bhattacharya, Arnab
Phonological reconstruction is one of the central problems in historical linguistics where a proto-word of an ancestral language is determined from the observed cognate words of daughter languages. Computational approaches to historical linguistics attempt to automate the task by learning models on available linguistic data. Several ideas and techniques drawn from computational biology have been successfully applied in the area of computational historical linguistics. Following these lines, we adapt MSA Transformer, a protein language model, to the problem of automated phonological reconstruction. MSA Transformer trains on multiple sequence alignments as input and is, thus, apt for application on aligned cognate words. We, hence, name our model as Cognate Transformer. We also apply the model on another associated task, namely, cognate reflex prediction, where a reflex word in a daughter language is predicted based on cognate words from other daughter languages. We show that our model outperforms the existing models on both tasks, especially when it is pre-trained on masked word prediction task.
Jambu: A historical linguistic database for South Asian languages
Arora, Aryaman, Farris, Adam, Basu, Samopriya, Kolichala, Suresh
We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.
New study tests machine learning on detection of borrowed words in world languages
Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along with many other words. Researchers from the Pontificia Universidad Catรณlica del Perรบ and the Max Planck Institute for the Science of Human History have investigated the ability of machine learning algorithms to identify lexical borrowings using word lists from a single language. Results published in the journal PLOS ONE show that current machine-learning methods alone are insufficient for borrowing detection, confirming that additional data and expert knowledge are needed to tackle one of historical linguistics' most pressing challenges. Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced in Plato's Kratylos dialog, in which Socrates discusses the challenge imposed by borrowed words on etymological studies.
New study tests machine learning on detection of borrowed words in world languages
IMAGE: Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along... view more Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced already in Plato's Kratylos dialogue, in which Socrates discusses the challenge imposed by borrowed words on etymological studies. In historical linguistics, lexical borrowings help researchers trace the evolution of modern languages and indicate cultural contact between distinct linguistic groups - whether recent or ancient. However, the techniques for identifying borrowed words have resisted formalization, demanding that researchers rely on a variety of proxy information and the comparison of multiple languages. "The automated detection of lexical borrowings is still one of the most difficult tasks we face in computational historical linguistics," says Johann-Mattis List, who led the study. In the current study, researchers from PUCP and MPI-SHH employed different machine learning techniques to train language models that mimic the way in which linguists identify borrowings when considering only the evidence provided by a single language: if sounds or the ways in which sounds combine to form words are atypical when comparing them with other words in the same language, this often hints to recent borrowings.
Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian
Rama, Taraka, Kolachina, Sudheer, B, Lakshmi Bai
In this paper we examine the usefulness of two classes of algorithms Distance Methods, Discrete Character Methods (Felsenstein and Felsenstein 2003) widely used in genetics, for predicting the family relationships among a set of related languages and therefore, diachronic language change. Applying these algorithms to the data on the numbers of shared cognates- with-change and changed as well as unchanged cognates for a group of six languages belonging to a Dravidian language sub-family given in Krishnamurti et al. (1983), we observed that the resultant phylogenetic trees are largely in agreement with the linguistic family tree constructed using the comparative method of reconstruction with only a few minor differences. Furthermore, we studied these minor differences and found that they were cases of genuine ambiguity even for a well-trained historical linguist. We evaluated the trees obtained through our experiments using a well-defined criterion and report the results here. We finally conclude that quantitative methods like the ones we examined are quite useful in predicting family relationships among languages. In addition, we conclude that a modest degree of confidence attached to the intuition that there could indeed exist a parallelism between the processes of linguistic and genetic change is not totally misplaced.
Linguistically Grounded Models of Language Change
Questions related to the evolution of language have recently known an impressive increase of interest (Briscoe, 2002). This short paper aims at questioning the scientific status of these models and their relations to attested data. We show that one cannot directly model non-linguistic factors (exogenous factors) even if they play a crucial role in language evolution. We then examine the relation between linguistic models and attested language data, as well as their contribution to cognitive linguistics.