Text Processing


Going deeper with recurrent networks: Sequence to Bag of Words Model

@machinelearnbot

Now, with deep learning, we can convert unstructured text to computable formats, effectively incorporating semantic knowledge for training machine learning models. Recurrent neural network (RNN) is a network containing neural layers that have a temporal feedback loop. Running on an NVIDIA GPU gave us the computation power to blaze through 10 million job descriptions in 15 minutes (32 wide RNN and 24 wide pre-trained interest word vectors). We use Deep Learning to compute semantic embeddings for keywords and titles.


Do-it-yourself NLP for bot developers – Rasa Blog – Medium

#artificialintelligence

In a previous post I mentioned that tools like wit and LUIS make intent classification and entity extraction so simple that you can build a bot like this during a hackathon. Amazingly, this is enough to correctly generalise, and to pick out "indian" as a cuisine type, based on its similarity to the reference words. Hence why I say that once you have a good change of variables, problems become easy. There are many ways we could combine word vectors to represent a sentence, but again we're going to do the simplest thing possible: add them up.


Bringing AI to BI – Text Analytics in Azure Machine Learning

#artificialintelligence

The core of the Bing News template starts with an Azure Logic App, which polls for news articles from the Bing News API at a preset schedule (5 minutes) on a list of user specified topics. As the data makes its way through the Logic App, the actual news article text is retrieved and sent through a series of Azure Functions for basic data transformation. These text enrichments could alternately be performed in the Azure ML portion of the pipeline using the "Extract Key Phrases from Text" module. We then use a separate periodically-invoked Logic App to call several Azure ML web services, which perform the complex tasks of Vowpal Wabbit topic clustering and named entity recognition (NER).


Ask not what you can do for machine learning… – Inside Machine learning – Medium

#artificialintelligence

There is a great read by Ed Newton-Rex that outlines the differences between Artificial Intelligence and Machine Learning in simple terms. Even though we're still some ways away from building our one true AI overlord aka Skynet, over the past two decades we have seen significant advances in the field of machine learning (ML). The task of knowing whether a photo has a mountain or a skyscraper no longer requires a human tagging the photo neither does the task of identifying relations and named entities in unstructured text. This is having a profound impact on how the next generation of software products are defined and, as a result, why the next generation of software product managers need to think about what ML can bring to the table.


Doing text analytics for Digital Humanities and Social Sciences with CLARIN (LDK tutorial), Galway 2017

VideoLectures.NET

Text is a basic material, a primary data layer, in many areas of humanities and social sciences. If we want to move forward with the agenda that the fields of digital humanities and computational social sciences are projecting, it is vital to bring together the technical areas that deal with automated text processing, and scholars in the humanities and social sciences. To foster new areas of research, it is necessary to not only understand what is out there in terms of proven technologies and infrastructures such as CLARIN, but also how the developers of text analytics can work with researchers in the humanities and social sciences to understand the challenges in each other's field better. What are the research questions of the researchers working on the texts?



Language Models, Word2Vec, and Efficient Softmax Approximations

@machinelearnbot

The difference is that the skip-gram model predicts context (surrounding) words given the current word, wheras the continuous bag of words model predicts the current word based on several surrounding words. For example, if we consider the sentence "The quick brown fox jumped over the lazy dog", and a window size of 2, we'd have the following pairs for the skip-gram model: In contrast, for the CBOW model, we'll input the context words within the window (such as "the", "brown", "fox") and aim to predict the target word "quick" (simply reversing the input to prediction pipeline from the skip-gram model). As discussed, the traditional softmax approach can become prohibitively expensive on large corpora, and the hierarchical softmax is a common alternative approach that approximates the softmax computation, but has logarithmic time complexity in the number of words in the vocabulary, as opposed to linear time complexity. Instead we learn word vectors by learning how to distinguish true pairs of (target, context) words from corrupted (target, random word from vocabulary) pairs.


30 Questions to test a data scientist on Natural Language Processing [Solution: Skilltest – NLP] - Analytics Vidhya

#artificialintelligence

"Analytics Vidhya is a great source to learn data science" "#Analytics-vidhya is a great source to learn @data_science." After performing stopword removal and punctuation replacement the text becomes: "Analytics vidhya great source learn data science" "The next meetup on data science will be held on 2017-09-21, previously it happened on 31/03, 2016" None if these expressions would be able to identify the dates in this text object. Choices A and B are correct because stopword removal will decrease the number of features in the matrix, normalization of words will also reduce redundant features, and, converting all words to lowercase will also decrease the dimensionality. Selection of the number of topics is directly proportional to the size of the data, while number of topic terms is not directly proportional to the size of the data.


A Practical Guide to Artificial Intelligence

#artificialintelligence

Details vary, but most AI systems today are "trained" by giving them examples of inputs and outputs (i.e., correct decisions), and letting the system generate its own internal rules to predict the output from the input. It also can't automatically associate similar terms, find relationships between terms, understand context, or measure sentiment. There are also many different types of recommendations and ways to make them, including recommendations for similar products, complementary products, most popular products, or best values; recommendations based on the individual, segments, or the entire customer base; recommendations choosing from a few options or a huge catalog; and recommendations in response to a search request. AI-based systems often combine these capabilities, simultaneously finding customer segments and creating new site versions tailored to these segments.


Who Made the News? Text Analysis using R, in 7 steps

@machinelearnbot

The dataset used for the analysis was obtained from Kaggle Datasets, and is attributed to UCI Machine Learning. Clean and pre-process the text by removing punctuations, removing "stop words" (a, the, and, …) using tm_map() function as shown below: Create wordclouds for Publisher "Reuters". The "color" option allows us to specify color palette, "rot.per" We create wordclouds for 2 more publishers (Celebrity Café & CBS_Local) as shown below. This dataset comes from the UCI Machine Learning Repository.