morpheme
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Hong Kong (0.04)
- (23 more...)
Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Singapore (0.04)
- (21 more...)
Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
Flint, George, Kislay, Kaustubh
Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Finland > Central Finland > Jyväskylä (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal
Guha, Ambalika, Saha, Sajal, Ballav, Debanjan, Mitra, Soumi, Chakraborty, Hritwick
Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.
- Asia > India > West Bengal > Kolkata (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- Asia > China (0.04)
- (14 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Education (1.00)
IASC: Interactive Agentic System for ConLangs
Taguchi, Chihiro, Sproat, Richard
We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Morpheme Induction for Emergent Language
Boldt, Brendon, Mortensen, David
We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
Morphological Synthesizer for Ge'ez Language: Addressing Morphological Complexity and Resource Limitations
Gebremariam, Gebrearegawi, Teklehaymanot, Hailay, Mezgebe, Gebregewergs
Ge'ez is an ancient Semitic language renowned for its unique alphabet. It serves as the script for numerous languages, including Tigrinya and Amharic, and played a pivotal role in Ethiopia's cultural and religious development during the Aksumite kingdom era. Ge'ez remains significant as a liturgical language in Ethiopia and Eritrea, with much of the national identity documentation recorded in Ge'ez. These written materials are invaluable primary sources for studying Ethiopian and Eritrean philosophy, creativity, knowledge, and civilization. Ge'ez has a complex morphological structure with rich inflectional and derivational morphology, and no usable NLP has been developed and published until now due to the scarcity of annotated linguistic data, corpora, labeled datasets, and lexicons. Therefore, we propose a rule-based Ge'ez morphological synthesizer to generate surface words from root words according to the morphological structures of the language. We used 1,102 sample verbs, representing all verb morphological structures, to test and evaluate the system. The system achieves a performance of 97.4%, outperforming the baseline model and suggesting that future work should build a comprehensive system considering morphological variations of the language. Keywords: Ge'ez, NLP, morphology, morphological synthesizer, rule-based
- Africa > Eritrea (0.25)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.05)
- North America > United States > Indiana (0.04)
- (8 more...)
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
Teklehaymanot, Hailay Kidu, Fazlija, Dren, Nejdl, Wolfgang
Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Singapore (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (17 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Hong Kong (0.04)
- (20 more...)