base form
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
Reif, Yuval, Kaplan, Guy, Schwartz, Roy
Large language models (LLMs) were shown to encode word form variations, such as "walk"->"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.
XSTEM: An exemplar-based stemming algorithm
Stemming is the process of reducing related words to a standard form by removing affixes from them. For example, eating, eats, and eaten can be reduced to the standard form eat by removing the -ing, -s, and -en from each. Stemming is a foundational step in many text processing pipelines, including information retrieval and language modeling [2], and as a feature reduction step for classification or lexical transfer learning tasks [1]. Most stemming algorithms are either rule-based or corpus-based. Rule-based stemmers use a set of manually created rules to transform words to their base form, typically in conjunction with reference to a dictionary for handling exceptions or modulating their output in some fashion. Corpus-based stemmers typically employ statistical machine learning methods to equate word forms based on distributional regularities derived from large amounts of text. Although researchers have demonstrated impressive results with corpus-based approaches (e.g., [2, and references therein]), rule-based stemming implementa-I was introduced to exemplar theory by Keith Johnson in a phonetics seminar at Ohio State. His model is called XMOD, which, as he notes, is the best name [7]. The name XSTEM comes from that.
Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora
In full-text search applications, the primary goal is to effectively retrieve and match relevant documents based on user queries. By focusing on finding the first form, or the lemma, of a word, the search process can be streamlined and optimized. The lemma serves as a normalized representation of a word's different inflected forms, allowing for a more accurate comparison between user queries and document content. This approach reduces the complexity and computational overhead associated with full morphological analysis, which includes extracting all possible forms of a word along with their grammatical properties. By prioritizing lemma retrieval, full-text search engines can achieve faster response times and more precise results, while minimizing the resources required for processing large volumes of text data. Consequently, building upon the foundation of pymorphy[1], the golemma library was developed to address the challenge of efficiently identifying the first form, or lemma, of words in the Russian language.
Text Classification with Machine Learning vs Deep Learning
The preprocessing part of the pipeline is a very important step, as it can impact greatly the model's performance. Depending on which model will be used, the original text may need to be modified so it has the most appropriate format to feed the model. When using Bag of Words, we want all similar words (e.g. To do this, we will extract the lemma of every token in the text and remove all stop words and every symbol that won't contribute to the model, which translates into lemmatization and cleaning of the text. On the other hand, if the context of the text is what we aim to focus on, then different words should not be merged into a single base form.
Have you thought-How Computer Interacts with the Humans?
NLP stands for Natural Language Processing.It is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. NLP has the ability of a computer to understand, analyze, manipulate, and potentially generate human language. Just 21% of the available data is present in the organized form in the 21st century. Millions of tweets, emails and web searches are generated daily, resulting in a huge amount of data increasing by the minute..And most of these data are in the form of text and unstructure.Natural Language Processing plays an important role in structuring data. Sentimental Analysis is the interpretation and classification of emotions in positive,negative or neutral within the text data using text analysis techniques.
Introduction to NLP Techniques
Data Scientists work with tons of data, and many times that data includes natural languages like text and speech. That text is usually quite similar to the natural language that we use in our day-to-day life. In this blog, we are going to see some common NLP techniques, with the help of which we can begin performing analysis and building models from textual data. So, let's start with a formal definitionโฆ There are various use cases of NLP in our day-to-day life. Computers are great at working with structured data like spreadsheets and database tables, but the problem is we humans usually communicate in words, not in tables.
Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding
Tan, Samson, Joty, Shafiq, Varshney, Lav R., Kan, Min-Yen
Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.
Ham Among the Spam
With a growth in advertisements and cold-messaging we are now receiving a nonstop coherent threads of commercial messages and emails. A user, like you and I, sometimes find it difficult to find a text/email which is actually useful to us or the one which we seek. Detection systems such as Spam detection system are becoming increasingly useful to classify the important data amongst the bundle of raw and undesired data. In this post we'll look at one such detection model, a spam detection model using NLP (natural language processing) and also learn about classification using Naรฏve Bayes. You can see that we are interested in calculating the posterior probability of P(h d) from the prior probability p(h) with P(D) and P(d h). UCI have an available data set of more than 5000 mixed text messages, click here.
Text Mining in Python: Steps and Examples
In today's world, according to the industry estimates only 20 percent of the data in the structured format is being generated as we speak as we tweet as we send messages on What's App, email, Facebook, Instagram or any text messages. And, the majority of this data exists in the textual form which is highly unstructured format, in order to produce meaningful insights from the text data then we need to access a method called Text Analysis. Text Mining is the process of deriving meaningful information from natural language text. Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages. In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.
A Beginners Guide to Natural Language Processing โ Towards Data Science โ Medium
As I have begun my journey as a data scientist one of the most captivating is that which seeks to understand the meaning and influence of words, Natural Language Processing (NLP). One of the greatest aspects of NLP is that is stretches across multiple areas of computational studies from artificial intelligence to computational linguistics all studying the interactions between computers and the language of humans. It is primarily concerned with programming computers to accurately and quickly process large amounts of natural language corpora. What are natural language corpora? It is the study of language as expressed by real world languages.