AITopics

2510.17001

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

arXiv.org Artificial IntelligenceJun-2-2024

XSTEM: An exemplar-based stemming algorithm

Baker, Kirk

Stemming is the process of reducing related words to a standard form by removing affixes from them. For example, eating, eats, and eaten can be reduced to the standard form eat by removing the -ing, -s, and -en from each. Stemming is a foundational step in many text processing pipelines, including information retrieval and language modeling [2], and as a feature reduction step for classification or lexical transfer learning tasks [1]. Most stemming algorithms are either rule-based or corpus-based. Rule-based stemmers use a set of manually created rules to transform words to their base form, typically in conjunction with reference to a dictionary for handling exceptions or modulating their output in some fashion. Corpus-based stemmers typically employ statistical machine learning methods to equate word forms based on distributional regularities derived from large amounts of text. Although researchers have demonstrated impressive results with corpus-based approaches (e.g., [2, and references therein]), rule-based stemming implementa-I was introduced to exemplar theory by Keith Johnson in a phonetics seminar at Ohio State. His model is called XMOD, which, as he notes, is the best name [7]. The name XSTEM comes from that.

algorithm, suffix, xstem, (14 more...)

2205.04355

Country:

North America > United States > Ohio (0.24)
North America > United States > New York > New York County > New York City (0.04)
Asia > Maldives (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area (0.69)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.77)

Kalugin-Balashov, Dmitriy

Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora

arXiv.org Artificial IntelligenceMay-18-2023

In full-text search applications, the primary goal is to effectively retrieve and match relevant documents based on user queries. By focusing on finding the first form, or the lemma, of a word, the search process can be streamlined and optimized. The lemma serves as a normalized representation of a word's different inflected forms, allowing for a more accurate comparison between user queries and document content. This approach reduces the complexity and computational overhead associated with full morphological analysis, which includes extracting all possible forms of a word along with their grammatical properties. By prioritizing lemma retrieval, full-text search engines can achieve faster response times and more precise results, while minimizing the resources required for processing large volumes of text data. Consequently, building upon the foundation of pymorphy[1], the golemma library was developed to address the challenge of efficiently identifying the first form, or lemma, of words in the Russian language.

information retrieval, natural language, paradigm, (14 more...)

2305.10848

Genre: Research Report (0.40)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.30)

#artificialintelligenceJan-15-2023, 10:55:11 GMT

Text Classification with Machine Learning vs Deep Learning

The preprocessing part of the pipeline is a very important step, as it can impact greatly the model's performance. Depending on which model will be used, the original text may need to be modified so it has the most appropriate format to feed the model. When using Bag of Words, we want all similar words (e.g. To do this, we will extract the lemma of every token in the text and remove all stop words and every symbol that won't contribute to the model, which translates into lemmatization and cleaning of the text. On the other hand, if the context of the text is what we aim to focus on, then different words should not be merged into a single base form.

base form, deep learning, text classification, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

#artificialintelligenceOct-30-2021, 04:30:12 GMT

Have you thought-How Computer Interacts with the Humans?

NLP stands for Natural Language Processing.It is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. NLP has the ability of a computer to understand, analyze, manipulate, and potentially generate human language. Just 21% of the available data is present in the organized form in the 21st century. Millions of tweets, emails and web searches are generated daily, resulting in a huge amount of data increasing by the minute..And most of these data are in the form of text and unstructure.Natural Language Processing plays an important role in structuring data. Sentimental Analysis is the interpretation and classification of emotions in positive,negative or neutral within the text data using text analysis techniques.

frequency, stop word, tokenization, (12 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.96)

#artificialintelligenceMay-7-2021, 17:55:36 GMT

Introduction to NLP Techniques

Data Scientists work with tons of data, and many times that data includes natural languages like text and speech. That text is usually quite similar to the natural language that we use in our day-to-day life. In this blog, we are going to see some common NLP techniques, with the help of which we can begin performing analysis and building models from textual data. So, let's start with a formal definition… There are various use cases of NLP in our day-to-day life. Computers are great at working with structured data like spreadsheets and database tables, but the problem is we humans usually communicate in words, not in tables.

library, natural language, text data, (14 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.51)

arXiv.org Artificial IntelligenceNov-18-2020

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Tan, Samson, Joty, Shafiq, Varshney, Lav R., Kan, Min-Yen

Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

base form, computational linguistic, proceedings, (14 more...)

2004.1487

Country:

Asia > Singapore (0.25)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(20 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

#artificialintelligenceOct-7-2020, 10:41:15 GMT

Ham Among the Spam

With a growth in advertisements and cold-messaging we are now receiving a nonstop coherent threads of commercial messages and emails. A user, like you and I, sometimes find it difficult to find a text/email which is actually useful to us or the one which we seek. Detection systems such as Spam detection system are becoming increasingly useful to classify the important data amongst the bundle of raw and undesired data. In this post we'll look at one such detection model, a spam detection model using NLP (natural language processing) and also learn about classification using Naïve Bayes. You can see that we are interested in calculating the posterior probability of P(h d) from the prior probability p(h) with P(D) and P(d h). UCI have an available data set of more than 5000 mixed text messages, click here.

frequency, machine learning, natural language, (12 more...)

Industry: Telecommunications (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)

#artificialintelligenceAug-22-2019, 18:54:25 GMT

Text Mining in Python: Steps and Examples

In today's world, according to the industry estimates only 20 percent of the data in the structured format is being generated as we speak as we tweet as we send messages on What's App, email, Facebook, Instagram or any text messages. And, the majority of this data exists in the textual form which is highly unstructured format, in order to produce meaningful insights from the text data then we need to access a method called Text Analysis. Text Mining is the process of deriving meaningful information from natural language text. Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages. In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.

artificial intelligence, data mining, natural language, (11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.85)

#artificialintelligenceSep-5-2017, 13:56:17 GMT

A Beginners Guide to Natural Language Processing – Towards Data Science – Medium

As I have begun my journey as a data scientist one of the most captivating is that which seeks to understand the meaning and influence of words, Natural Language Processing (NLP). One of the greatest aspects of NLP is that is stretches across multiple areas of computational studies from artificial intelligence to computational linguistics all studying the interactions between computers and the language of humans. It is primarily concerned with programming computers to accurately and quickly process large amounts of natural language corpora. What are natural language corpora? It is the study of language as expressed by real world languages.

beginner guide, lemmatization, nlp, (8 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)