wordlist
Self-Supervised Borrowing Detection on Multilingual Wordlists
This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Europe > Montenegro (0.04)
- North America > United States > New Jersey (0.04)
- Asia (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.67)
A Fundamental Algorithm for Dependency Parsing (With Corrections)
Abstract-This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. This paper develops, from first principles, several variations on a fundamental algorithm for parsing natural language into dependency trees. This is an exposition of an algorithm that has been known, in some form, since the 1960s but is not presented systematically in the extant literature. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached. There is good evidence that the parsing process used by the human mind has these properties [1].
- North America > United States > Georgia > Clarke County > Athens (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
Identifiability and minimality bounds of quantum and post-quantum models of classical stochastic processes
Riechers, Paul M., Elliott, Thomas J.
To make sense of the world around us, we develop models, constructed to enable us to replicate, describe, and explain the behaviours we see. Focusing on the broad case of sequences of correlated random variables, i.e., classical stochastic processes, we tackle the question of determining whether or not two different models produce the same observable behavior. This is the problem of identifiability. Curiously, the physics of the model need not correspond to the physics of the observations; recent work has shown that it is even advantageous -- in terms of memory and thermal efficiency -- to employ quantum models to generate classical stochastic processes. We resolve the identifiability problem in this regime, providing a means to compare any two models of a classical process, be the models classical, quantum, or `post-quantum', by mapping them to a canonical `generalized' hidden Markov model. Further, this enables us to place (sometimes tight) bounds on the minimal dimension required of a quantum model to generate a given classical stochastic process.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (2 more...)
Feature-Refined Unsupervised Model for Loanword Detection
We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.
- North America > Canada > Quebec (0.04)
- North America > United States > Oregon (0.04)
- North America > United States > California (0.04)
- (2 more...)
Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
Rubehn, Arne, Rzymski, Christoph, Ciucci, Luca, van Dam, Kellen Parker, Kučerová, Alžběta, Bocklage, Katja, Snee, David, Stephen, Abishek, List, Johann-Mattis
Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists
Snee, David, Ciucci, Luca, Rubehn, Arne, van Dam, Kellen Parker, List, Johann-Mattis
Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.
- Europe > Germany > Saxony > Leipzig (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- (9 more...)
PenTest++: Elevating Ethical Hacking with AI and Automation
Al-Sinani, Haitham S., Mitchell, Chris J.
Traditional ethical hacking relies on skilled professionals and time-intensive command management, which limits its scalability and efficiency. To address these challenges, we introduce PenTest++, an AI-augmented system that integrates automation with generative AI (GenAI) to optimise ethical hacking workflows. Developed in a controlled virtual environment, PenTest++ streamlines critical penetration testing tasks, including reconnaissance, scanning, enumeration, exploitation, and documentation, while maintaining a modular and adaptable design. The system balances automation with human oversight, ensuring informed decision-making at key stages, and offers significant benefits such as enhanced efficiency, scalability, and adaptability. However, it also raises ethical considerations, including privacy concerns and the risks of AI-generated inaccuracies (hallucinations). This research underscores the potential of AI-driven systems like PenTest++ to complement human expertise in cybersecurity by automating routine tasks, enabling professionals to focus on strategic decision-making. By incorporating robust ethical safeguards and promoting ongoing refinement, PenTest++ demonstrates how AI can be responsibly harnessed to address operational and ethical challenges in the evolving cybersecurity landscape.
- Europe > Poland > Kuyavian-Pomeranian Province > Bydgoszcz (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.56)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
A Likelihood Ratio Test of Genetic Relationship among Languages
Akavarapu, V. S. D. S. Mahesh, Bhattacharya, Arnab
Lexical resemblances among a group of languages indicate that the languages could be genetically related, i.e., they could have descended from a common ancestral language. However, such resemblances can arise by chance and, hence, need not always imply an underlying genetic relationship. Many tests of significance based on permutation of wordlists and word similarity measures appeared in the past to determine the statistical significance of such relationships. We demonstrate that although existing tests may work well for bilateral comparisons, i.e., on pairs of languages, they are either infeasible by design or are prone to yield false positives when applied to groups of languages or language families. To this end, inspired by molecular phylogenetics, we propose a likelihood ratio test to determine if given languages are related based on the proportion of invariant character sites in the aligned wordlists applied during tree inference. Further, we evaluate some language families and show that the proposed test solves the problem of false positives. Finally, we demonstrate that the test supports the existence of macro language families such as Nostratic and Macro-Mayan.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Kansas (0.04)
- (9 more...)
Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists
Miller, John E., List, Johann-Mattis
Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All methods perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.
- Europe > Germany > Saxony > Leipzig (0.05)
- South America > Peru > Lima Department > Lima Province > Lima (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
- Research Report > New Finding (0.47)
- Research Report > Experimental Study (0.47)
Inference of Partial Colexifications from Multilingual Wordlists
The past years have seen a drastic rise in studies devoted to the investigation of colexification patterns in individual languages families in particular and the languages of the world in specific. Specifically computational studies have profited from the fact that colexification as a scientific construct is easy to operationalize, enabling scholars to infer colexification patterns for large collections of cross-linguistic data. Studies devoted to partial colexifications -- colexification patterns that do not involve entire words, but rather various parts of words--, however, have been rarely conducted so far. This is not surprising, since partial colexifications are less easy to deal with in computational approaches and may easily suffer from all kinds of noise resulting from false positive matches. In order to address this problem, this study proposes new approaches to the handling of partial colexifications by (1) proposing new models with which partial colexification patterns can be represented, (2) developing new efficient methods and workflows which help to infer various types of partial colexification patterns from multilingual wordlists, and (3) illustrating how inferred patterns of partial colexifications can be computationally analyzed and interactively visualized.
- Europe > Germany > Saxony > Leipzig (0.05)
- Asia > China > Fujian Province > Fuzhou (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (11 more...)