Classification, or categorization, is the task of assigning labels to various items, such as products. Classification happens effortlessly in humans' everyday lives. Imagine, for example, that you are going to the grocery store. There, you will implicitly assign labels to the products such as "healthy" versus "not healthy", "GMO" versus "non GMO", or "fresh" versus "stale".
Text classification is the task of assigning labels to text documents. Documents can be webpages, emails, advertisements, or even product reviews. Here are some examples of text classification: categorizing a web page as "English language" versus "Chinese language" versus "other language", an email as "spam" versus "not spam", or a product review as "positive" versus "negative". Since the number of the existing documents is already huge and it is growing rapidly every day, it is impossible to ask humans to manually classify every document. As a result, we need techniques that can automatically assign labels to text.
In order to better understand how text classifiers work, let’s think about how people classify items. Back to the grocery store example, imagine that we want to classify an item as "healthy" versus "not healthy". In order to make the decision, we will first identify a set of item features that is important for the classification, (e.g., percentage of sugar, fat, salt). Then, we will extract those features from the actual product (e.g., 10% sugar, 1% of fat, and 0.1% of salt). Finally, we will combine the values of the features in some way and if they exceed a specific threshold we will classify the product as "healthy" or "not healthy".
A spam detection classifier that labels an email as "spam" or "not spam" works in a similar way: it accepts as input a set of emails, which have already been labeled as "spam" or "not spam", and extracts features, such as the domain of the sender (e.g., the country), or the appearance of links and images in the email. Text classification also benefits from features such as the words or phrases in the email. For example, spam emails usually contain phrases such as "free i-phone" or "your credit card info is needed". The classifier uses the documents with the known labels and learns a model that, given the values of the features, can classify new incoming emails as "spam" or "not spam".
Although there is a large body of research in the text classification domain, there is a growing need for developing new text classification models that are able to successfully distinguish the label of documents. This is due to reasons such as the arrival of the social networks, the evolution of the writing style, or the appearance of new formats of textual information (e.g., emoji). Recent research has witnessed much success in automatic text classification due to the advent of deep learning, but the problem continues to be a challenging one in the Artificial Intelligence community.
- Pigi Kouki
We often see transfer learning applied to computer vision models, but what about using it for text classification? Enter TensorFlow Hub, a library for enhancing your TF models with transfer learning. Transfer learning is the process of taking the weights and variables of a pre-existing model that has already been trained on lots of data and leveraging it for your own data and prediction task. One of the many benefits of transfer learning is that you don't need to provide as much of your own training data as you would if you were starting from scratch. But where do these pre-existing models come from?
Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.
This is the fourth in a series of articles intended to make Machine Learning more approachable to those without technical training. Prior articles introduce the concept of Machine Learning, show how the process of learning works in general, and describe commonly used algorithms. You can start the series here. In this installment of the series, we review some common measures and considerations when assessing the effectiveness and value of a classification algorithm. To get started, let's assume that an algorithm has been developed on a training set and has generalized well.
It provides current state-of-the-art accuracy and speed levels, and has an active open source community. However, since SpaCy is a relative new NLP library, and it's not as widely adopted as NLTK. There is not yet sufficient tutorials available. In this post, we will demonstrate how text classification can be implemented using spaCy without having any deep learning experience. It s often time consuming and frustrating experience for a young researcher to find and select a suitable academic conference to submit his (or her) academic papers.
After doing the usual Feature Engineering, Selection, and of course, implementing a model and getting some output in forms of a probability or a class, the next step is to find out how effective is the model based on some metric using test datasets. Different performance metrics are used to evaluate different Machine Learning Algorithms. For now, we will be focusing on the ones used for Classification problems. We can use classification performance metrics such as Log-Loss, Accuracy, AUC(Area under Curve) etc. Another example of metric for evaluation of machine learning algorithms is precision, recall, which can be used for sorting algorithms primarily used by search engines. The metrics that you choose to evaluate your machine learning model is very important.
Text classification implementation with TensorFlow can be simple. One of the areas where text classification can be applied -- chatbot text processing and intent resolution. I will describe step by step in this post, how to build TensorFlow model for text classification and how classification is done. Please refer to my previous post related to similar topic -- Contextual Chatbot with TensorFlow, Node.js and Oracle JET -- Steps How to Install and Get It Working. I would recommend to go through this great post about chatbot implementation -- Contextual Chatbots with Tensorflow.
The use of mobile phones has skyrocketed in the last decade leading to a new area for junk promotions from disreptable marketers. People innocently give out their mobile phone numbers while utilizing day to day services and are then flooded with spam promotional messages. In this post we will take a look at classifying SMS messages using the Naive Bayes Machine Learning model, understand why Naive Bayes works well for this use case and also dive a little into wordclouds to visualize this dataset.
This story is a step-by-step guide to how I built a language detection model using machine learning (that ended up being 97% accurate) in under 20 minutes. Language detection is a great use case for machine learning, more specifically, for text classification. Given some text from an e-mail, news article, output of speech-to-text capabilities, or anywhere else, a language detection model will tell you what language it is in. This is a great way to quickly categorize and sort information, and apply additional layers of workflows that are language specific. For example, if you want to apply spell checking to a Word document, you first have to pick the correct language for the dictionary being used.
For beginners in data science and machine learning, a common problem is to get hands on good, clean data set for quick practice. Regression and classification are two most common supervised machine learning tasks, that a practitioner of data science have to deal with. It is not always possible to get well-structured data set for practicing various algorithms that one learns. Now, Scikit-Learn, the leading machine learning library in Python, does provide random data set generation capability for regression and classification problems. However, the user have no easy control over the underlying mechanics of the data generation and the regression output are not a definitive function of inputs -- they are truly random.
At this point, we have assembled our dataset and gained insights into the key characteristics of our data. Next, based on the metrics we gathered in Step 2, we should think about which classification model we should use. This means/ asking questions such as, "How do we present the text data to an algorithm that expects numeric input?" (this is called data preprocessing and vectorization), "What type of model should we use?", "What configuration parameters should we use for our model?", Thanks to decades of research, we have access to a large array of data preprocessing and model configuration options. However, the availability of a very large array of viable options to choose from greatly increases the complexity and the scope of the particular problem at hand.