AITopics | historical linguistics

Collaborating Authors

historical linguistics

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

Hariharan, Ananth, Mortensen, David

arXiv.org Artificial IntelligenceDec-8-2025

This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

machine learning, natural language, sanskrit, (18 more...)

arXiv.org Artificial Intelligence

2512.05364

Country:

Europe (1.00)
North America > United States (0.68)
Asia (0.68)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Feature-Refined Unsupervised Model for Loanword Detection

Kpoglu, Promise Dodzi

arXiv.org Artificial IntelligenceAug-26-2025

We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.17923

Country:

Europe (0.28)
North America (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin

Bothwell, Stephen, DuSell, Brian, Chiang, David, Krostenko, Brian

arXiv.org Artificial IntelligenceApr-25-2024

Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.

dataset, linguistics, pila, (16 more...)

arXiv.org Artificial Intelligence

2404.16341

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(13 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer

Akavarapu, V. S. D. S. Mahesh, Bhattacharya, Arnab

arXiv.org Artificial IntelligenceFeb-5-2024

Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods for cognate identification are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision, thereby proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.

cogtran2, computational linguistic, linguistics, (13 more...)

arXiv.org Artificial Intelligence

2402.02926

Country:

North America > United States > New Mexico (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Kansas (0.04)
(10 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Cognate Transformer for Automated Phonological Reconstruction and Cognate Reflex Prediction

Akavarapu, V. S. D. S. Mahesh, Bhattacharya, Arnab

arXiv.org Artificial IntelligenceOct-18-2023

Phonological reconstruction is one of the central problems in historical linguistics where a proto-word of an ancestral language is determined from the observed cognate words of daughter languages. Computational approaches to historical linguistics attempt to automate the task by learning models on available linguistic data. Several ideas and techniques drawn from computational biology have been successfully applied in the area of computational historical linguistics. Following these lines, we adapt MSA Transformer, a protein language model, to the problem of automated phonological reconstruction. MSA Transformer trains on multiple sequence alignments as input and is, thus, apt for application on aligned cognate words. We, hence, name our model as Cognate Transformer. We also apply the model on another associated task, namely, cognate reflex prediction, where a reflex word in a daughter language is predicted based on cognate words from other daughter languages. We show that our model outperforms the existing models on both tasks, especially when it is pre-trained on masked word prediction task.

language family, reconstruction, transformer, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.emnlp-main.423

2310.07487

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New Mexico (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Jambu: A historical linguistic database for South Asian languages

Arora, Aryaman, Farris, Adam, Basu, Samopriya, Kolichala, Suresh

arXiv.org Artificial IntelligenceJun-4-2023

We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

artificial intelligence, linguistics, natural language, (15 more...)

arXiv.org Artificial Intelligence

2306.02514

Country:

Asia > India > Rajasthan (0.05)
North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > Texas > Dallas County > Dallas (0.04)
(26 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

New study tests machine learning on detection of borrowed words in world languages

#artificialintelligenceDec-12-2020, 02:04:04 GMT

Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along with many other words. Researchers from the Pontificia Universidad Católica del Perú and the Max Planck Institute for the Science of Human History have investigated the ability of machine learning algorithms to identify lexical borrowings using word lists from a single language. Results published in the journal PLOS ONE show that current machine-learning methods alone are insufficient for borrowing detection, confirming that additional data and expert knowledge are needed to tackle one of historical linguistics' most pressing challenges. Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced in Plato's Kratylos dialog, in which Socrates discusses the challenge imposed by borrowed words on etymological studies.

detection, new study test machine, world language, (1 more...)

#artificialintelligence

Country: South America > Peru (0.29)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

New study tests machine learning on detection of borrowed words in world languages

#artificialintelligenceDec-10-2020, 01:08:22 GMT

IMAGE: Lexical borrowing is very widespread and may affect even those words that play an important role in our daily life. English'mountain', for example, was borrowed from Old French, along... view more Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced already in Plato's Kratylos dialogue, in which Socrates discusses the challenge imposed by borrowed words on etymological studies. In historical linguistics, lexical borrowings help researchers trace the evolution of modern languages and indicate cultural contact between distinct linguistic groups - whether recent or ancient. However, the techniques for identifying borrowed words have resisted formalization, demanding that researchers rely on a variety of proxy information and the comparison of multiple languages. "The automated detection of lexical borrowings is still one of the most difficult tasks we face in computational historical linguistics," says Johann-Mattis List, who led the study. In the current study, researchers from PUCP and MPI-SHH employed different machine learning techniques to train language models that mimic the way in which linguists identify borrowings when considering only the evidence provided by a single language: if sounds or the ways in which sounds combine to form words are atypical when comparing them with other words in the same language, this often hints to recent borrowings.

detection, historical linguistics, new study test machine, (6 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian

Rama, Taraka, Kolachina, Sudheer, B, Lakshmi Bai

arXiv.org Artificial IntelligenceJan-3-2014

In this paper we examine the usefulness of two classes of algorithms Distance Methods, Discrete Character Methods (Felsenstein and Felsenstein 2003) widely used in genetics, for predicting the family relationships among a set of related languages and therefore, diachronic language change. Applying these algorithms to the data on the numbers of shared cognates- with-change and changed as well as unchanged cognates for a group of six languages belonging to a Dravidian language sub-family given in Krishnamurti et al. (1983), we observed that the resultant phylogenetic trees are largely in agreement with the linguistic family tree constructed using the comparative method of reconstruction with only a few minor differences. Furthermore, we studied these minor differences and found that they were cases of genuine ambiguity even for a well-trained historical linguist. We evaluated the trees obtained through our experiments using a well-defined criterion and report the results here. We finally conclude that quantitative methods like the ones we examined are quite useful in predicting family relationships among languages. In addition, we conclude that a modest degree of confidence attached to the intuition that there could indeed exist a parallelism between the processes of linguistic and genetic change is not totally misplaced.

artificial intelligence, krishnamurti, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1401.0708

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Linguistically Grounded Models of Language Change

Poibeau, Thierry

arXiv.org Artificial IntelligenceDec-1-2009

Questions related to the evolution of language have recently known an impressive increase of interest (Briscoe, 2002). This short paper aims at questioning the scientific status of these models and their relations to attested data. We show that one cannot directly model non-linguistic factors (exogenous factors) even if they play a crucial role in language evolution. We then examine the relation between linguistic models and attested language data, as well as their contribution to cognitive linguistics.

artificial intelligence, evolution, natural language, (16 more...)

arXiv.org Artificial Intelligence

cs/0607053

Country: Europe (0.16)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.36)
Information Technology > Artificial Intelligence > Natural Language (0.31)

Add feedback