text processing


CJKV Information Processing, 2nd Edition - Programmer Books

#artificialintelligence

First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This second edition reflects the considerable impact that Unicode, XML, OpenType, and newer operating systems such as Windows XP, Vista, Mac OS X, and Linux have had on East Asian text processing in recent years.


Text Mining in Python: Steps and Examples

#artificialintelligence

In today's world, according to the industry estimates only 20 percent of the data in the structured format is being generated as we speak as we tweet as we send messages on What's App, email, Facebook, Instagram or any text messages. And, the majority of this data exists in the textual form which is highly unstructured format, in order to produce meaningful insights from the text data then we need to access a method called Text Analysis. Text Mining is the process of deriving meaningful information from natural language text. Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages. In other words, NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.


Baidu Open-Sources ERNIE 2.0, Beats BERT in Natural Language Processing Tasks

#artificialintelligence

In a recent blog post, Baidu, the Chinese search engine and e-commerce giant, announced their latest open-source, natural language understanding framework called ERNIE 2.0. They also shared recent test results, including achieving state-of-the art (SOTA) results and outperforming existing frameworks, including Google's BERT and XLNet in 16 NLP tasks in both Chinese and English. ERNIE 2.0, more formally known as Enhanced Representation through kNowledge IntEgration, is a continual pre-training framework for language understanding. We proposed a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through constant multi-task learning. In this framework, different customized tasks can be incrementally introduced at any time and are trained through multi-task learning that permits the encoding of lexical, syntactic and semantic information across tasks.


Baidu Research

#artificialintelligence

However, besides co-occurrence, there is other valuable lexical, syntactic and semantic information in training corpora. For example, named entities, such as names, locations and organizations, could contain conceptual information. Sentence order and proximity between sentences would allow models to learn structure-aware representations. What's more, semantic similarity at the document level or discourse relations among sentences could train the models to learn semantic-aware representations. Hypothetically speaking, would it be possible to further improve performance if the model was trained to constantly learn a larger variety of tasks?


Natural language processing in Apache Spark using NLTK (part 1/2)

#artificialintelligence

In the very basic form, Natural language processing is a field of Artificial Intelligence that explores computational methods for interpreting and processing natural language, in either textual or spoken form. In this series of 2 blogs I'll be discussing Natural Language Processing, NLTK in Spark, environment setup and some basic implementations in the first one, and how we can create an NLP application which is leveraging the benefits of Bigdata in the second. A Natural language or Ordinary language is any language that has evolved naturally with time in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech, signing or text. Signs, Menus, Email, SMS, Web Page and so much more… The list is endless.


Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources

arXiv.org Artificial Intelligence

Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources Daniel Specht Menezes Departamento de Inform atica PUC-Rio Rio de Janeiro, Brazil danielssmenezes@gmail.com Abstract --With the recent progress in machine learning, boosted by techniques such as deep learning, many tasks can be successfully solved once a large enough dataset is available for training. Nonetheless, human-annotated datasets are often expensive to produce, especially when labels are fine-grained, as is the case of Named Entity Recognition (NER), a task that operates with labels on a word-level. In this paper, we propose a method to automatically generate labeled datasets for NER from public data sources by exploiting links and structured data from DBpedia and Wikipedia. Due to the massive size of these data sources, the resulting dataset - SESAME 1 - is composed of millions of labeled sentences. We detail the method to generate the dataset, report relevant statistics, and design a baseline using a neural network, showing that our dataset helps building better NER predictors.


SHREWD: Semantic Hierarchy-based Relational Embeddings for Weakly-supervised Deep Hashing

arXiv.org Artificial Intelligence

Using class labels to represent class similarity is a typical approach to training deep hashing systems for retrieval; samples from the same or different classes take binary 1 or 0 similarity values. This similarity does not model the full rich knowledge of semantic relations that may be present between data points. In this work we build upon the idea of using semantic hierarchies to form distance metrics between all available sample labels; for example cat to dog has a smaller distance than cat to guitar. We combine this type of semantic distance into a loss function to promote similar distances between the deep neural network embeddings. We also introduce an empirical Kullback-Leibler divergence loss term to promote binarization and uniformity of the embeddings. We test the resulting SHREWD method and demonstrate improvements in hierarchical retrieval scores using compact, binary hash codes instead of real valued ones, and show that in a weakly supervised hashing setting we are able to learn competitively without explicitly relying on class labels, but instead on similarities between labels.


Lexical semantics - Wikipedia

#artificialintelligence

Lexical semantics (also known as lexicosemantics), is a subfield of linguistic semantics. The units of analysis in lexical semantics are lexical units which include not only words but also sub-words or sub-units such as affixes and even compound words and phrases. Lexical units make up the catalogue of words in a language, the lexicon. Lexical semantics looks at how the meaning of the lexical units correlates with the structure of the language or syntax. This is referred to as syntax-semantic interface.[1] Lexical units, also referred to as syntactic atoms, can stand alone such as in the case of root words or parts of compound words or they necessarily attach to other units such as prefixes and suffixes do. The former are called free morphemes and the latter bound morphemes.[2]


Word Embedding for Response-To-Text Assessment of Evidence

arXiv.org Artificial Intelligence

Manually grading the Response to Text Assessment (RTA) is labor intensive. Therefore, an automatic method is being developed for scoring analytical writing when the RTA is administered in large numbers of classrooms. Our long-term goal is to also use this scoring method to provide formative feedback to students and teachers about students' writing quality. As a first step towards this goal, interpretable features for automatically scoring the evidence rubric of the RTA have been developed. In this paper, we present a simple but promising method for improving evidence scoring by employing the word embedding model. We evaluate our method on corpora of responses written by upper elementary students.


Unsupervised Context Retrieval for Long-tail Entities

arXiv.org Artificial Intelligence

Monitoring entities in media streams often relies on rich entity representations, like structured information available in a knowledge base (KB). For long-tail entities, such monitoring is highly challenging, due to their limited, if not entirely missing, representation in the reference KB. In this paper, we address the problem of retrieving textual contexts for monitoring long-tail entities. We propose an unsupervised method to overcome the limited representation of long-tail entities by leveraging established entities and their contexts as support information. Evaluation on a purpose-built test collection shows the suitability of our approach and its robustness for out-of-KB entities.