Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

arXiv.org Machine Learning

--Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task. I NTRODUCTION Decomposing individual words into finer-granularity morphemes is a necessary step for automatically preprocessing concatenative vocabularies where the number of unique word forms is very large. While linguistic approaches can be used to tackle such segmentation, such rule-based approaches are often tailored to specific languages or domains. As such, data-driven, unsupervised methods that forgo linguistic knowledge have been studied [7], [18]. Typically, these methods focus on segmenting words by applying a probabilistic model or compression algorithms to a full text corpus.


Scientific Vocabulary - Google Search

#artificialintelligence

Science learning involves lots of new vocabulary words. When helping your child learn new science words, focus on words that allow you to teach more than just that one word. This can be done by considering a word's morphemes. A morpheme is a meaningful part or unit of a word that can't be divided into smaller parts.


Unsupervised Morphological Segmentation for Detecting Parkinson’s Disease

AAAI Conferences

The growth of life expectancy entails a rise in prevalence of aging-related neurodegenerative disorders, such as Parkinson's disease. In the ongoing quest to find sensitive behavioral markers of this condition, computerized tools prove particularly promising. Here, we propose a novel method utilizing unsupervised morphological segmentation for accessing morphological properties of a speaker's language. According to our experiments on German, our method can classify patients vs. healthy controls with 81 percent accuracy, and estimate the neurological state of PD patients with Pearson correlation of 0.46 with respect to the unified Parkinson's disease rating scale. Our work is the first study to show that unsupervised morphological segmentation can be used for automatic detection of a neurological disorder.


Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations

AAAI Conferences

Distributional hypothesis lies in the root of most existing word representation models by inferring word meaning from its external contexts. However, distributional models cannot handle rare and morphologically complex words very well and fail to identify some fine-grained linguistic regularity as they are ignoring the word forms. On the contrary, morphology points out that words are built from some basic units, i.e., morphemes. Therefore, the meaning and function of such rare words can be inferred from the words sharing the same morphemes, and many syntactic relations can be directly identified based on the word forms. However, the limitation of morphology is that it cannot infer the relationship between two words that do not share any morphemes. Considering the advantages and limitations of both approaches, we propose two novel models to build better word representations by modeling both external contexts and internal morphemes in a jointly predictive way, called BEING and SEING. These two models can also be extended to learn phrase representations according to the distributed morphology theory. We evaluate the proposed models on similarity tasks and analogy tasks. The results demonstrate that the proposed models can outperform state-of-the-art models significantly on both word and phrase representation learning.


A Glossary of Linguistic Terms

AITopics Original Links

Warning: This web page was originally constructed to help computer science students who were taking my module on natural language processing. Some terms may be used differently by different authors. Unless otherwise stated, definitions are based on the English language. If you find any errors, please e-mail me at p.coxhead@cs.bham.ac.uk. The verb in an active sentence can be said to be in the active voice. Examples are colourless and green which qualify ideas in Colourless green ideas sleep furiously. Adjectives can also appear after verbs like be, e.g. Examples are furiously which qualifies the verb sleep in Colourless green ideas sleep furiously, or intensely which qualifies stared in He stared at me intensely. Adverbs can also qualify adjectives, e.g. Many English adverbs are formed from an adjective plus the ending -ly. Words like very, which can only qualify adjectives or adverbs but not verbs, are sometimes called adverbs, but are perhaps best put in a separate category. In its broadest sense, an affix can be a prefix, a suffix, or an infix. More narrowly, infixes are sometimes treated separately. The stop and fricative must be produced in a very similar positions in the mouth.