AITopics | morfessor

Collaborating Authors

morfessor

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Lexically Grounded Subword Segmentation

Libovický, Jindřich, Helcl, Jindřich

arXiv.org Artificial IntelligenceJun-19-2024

We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation precision on morpheme boundaries and improved R\'enyi efficiency in 8 languages. Although the proposed tokenization methods do not have a large impact on automatic translation quality, we observe consistent performance gains in the arguably more morphological task of part-of-speech tagging.

orig, segmentation, subword, (13 more...)

arXiv.org Artificial Intelligence

2406.1356

Country:

North America > Canada > Ontario > Toronto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
(16 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.86)

Add feedback

More than Just Statistical Recurrence: Human and Machine Unsupervised Learning of M\=aori Word Segmentation across Morphological Processes

Varatharaj, Ashvini, Todd, Simon

arXiv.org Artificial IntelligenceMar-21-2024

Non-M\=aori-speaking New Zealanders (NMS)are able to segment M\=aori words in a highlysimilar way to fluent speakers (Panther et al.,2024). This ability is assumed to derive through the identification and extraction of statistically recurrent forms. We examine this assumption by asking how NMS segmentations compare to those produced by Morfessor, an unsupervised machine learning model that operates based on statistical recurrence, across words formed by a variety of morphological processes. Both NMS and Morfessor succeed in segmenting words formed by concatenative processes (compounding and affixation without allomorphy), but NMS also succeed for words that invoke templates (reduplication and allomorphy) and other cues to morphological structure, implying that their learning process is sensitive to more than just statistical recurrence.

morfessor, segmentation, template, (16 more...)

arXiv.org Artificial Intelligence

2403.14444

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
Oceania > New Zealand > North Island > Bay of Plenty Region > Whakatane (0.04)
(7 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.41)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.40)

Add feedback

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Limisiewicz, Tomasz, Blevins, Terra, Gonen, Hila, Ahia, Orevaoghene, Zettlemoyer, Luke

arXiv.org Artificial IntelligenceMar-15-2024

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

computational linguistic, morpheme, representation, (15 more...)

arXiv.org Artificial Intelligence

2403.10691

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > Canada > Ontario > Toronto (0.04)
(17 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.85)

Add feedback

Effects of sub-word segmentation on performance of transformer language models

Hou, Jue, Katinskaia, Anisia, Vu, Anh-Duc, Yangarber, Roman

arXiv.org Artificial IntelligenceOct-26-2023

Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation -- Morfessor and StateMorph. We train the models for several languages -- including ones with very rich morphology -- and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show 4. that LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE -- both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability of LMs, since they reduce the model cost: size and computation time. While (2) reduces cost only in the training phase, (4) does so also in the inference phase.

bpe, language model, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2305.0548

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
(2 more...)

Add feedback

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser, Marwa, Mager, Manuel, Hamed, Injy, Habash, Nizar, Abdennadher, Slim, Vu, Ngoc Thang

arXiv.org Artificial IntelligenceApr-30-2023

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

machine learning, natural language, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2210.0699

Country:

Europe (1.00)
Asia (1.00)
Africa > Middle East > Egypt (0.28)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.46)

Industry: Energy > Oil & Gas (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback