Text Classification

Ham or Spam? SMS Text Classification with Machine Learning


The use of mobile phones has skyrocketed in the last decade leading to a new area for junk promotions from disreptable marketers. People innocently give out their mobile phone numbers while utilizing day to day services and are then flooded with spam promotional messages. In this post we will take a look at classifying SMS messages using the Naive Bayes Machine Learning model, understand why Naive Bayes works well for this use case and also dive a little into wordclouds to visualize this dataset.

How I trained a language detection AI in 20 minutes with a 97% accuracy


This story is a step-by-step guide to how I built a language detection model using machine learning (that ended up being 97% accurate) in under 20 minutes. Language detection is a great use case for machine learning, more specifically, for text classification. Given some text from an e-mail, news article, output of speech-to-text capabilities, or anywhere else, a language detection model will tell you what language it is in. This is a great way to quickly categorize and sort information, and apply additional layers of workflows that are language specific. For example, if you want to apply spell checking to a Word document, you first have to pick the correct language for the dictionary being used.

Random regression and classification problem generation with symbolic expression


For beginners in data science and machine learning, a common problem is to get hands on good, clean data set for quick practice. Regression and classification are two most common supervised machine learning tasks, that a practitioner of data science have to deal with. It is not always possible to get well-structured data set for practicing various algorithms that one learns. Now, Scikit-Learn, the leading machine learning library in Python, does provide random data set generation capability for regression and classification problems. However, the user have no easy control over the underlying mechanics of the data generation and the regression output are not a definitive function of inputs -- they are truly random.

Step 2.5: Choose a Model ML Universal Guides Google Developers


At this point, we have assembled our dataset and gained insights into the key characteristics of our data. Next, based on the metrics we gathered in Step 2, we should think about which classification model we should use. This means/ asking questions such as, "How do we present the text data to an algorithm that expects numeric input?" (this is called data preprocessing and vectorization), "What type of model should we use?", "What configuration parameters should we use for our model?", Thanks to decades of research, we have access to a large array of data preprocessing and model configuration options. However, the availability of a very large array of viable options to choose from greatly increases the complexity and the scope of the particular problem at hand.

Projects In Machine Learning NLP for Text Classification with NLTK & Scikit-learn Eduonix


In this tutorial, we will cover Natural Language Processing for Text Classification with NLTK & Scikit-learn. Remember the last Natural Language Processing project we did? We will be using all that information to create a Spam filter. This tutorial will also cover Feature Engineering and ensemble NLP in text classification. This project will use Jupiter Notebook running Python 2.7.

Log Skeletons: A Classification Approach to Process Discovery

arXiv.org Artificial Intelligence

To test the effectiveness of process discovery algorithms, a Process Discovery Contest (PDC) has been set up. This PDC uses a classification approach to measure this effectiveness: The better the discovered model can classify whether or not a new trace conforms to the event log, the better the discovery algorithm is supposed to be. Unfortunately, even the state-of-the-art fully-automated discovery algorithms score poorly on this classification. Even the best of these algorithms, the Inductive Miner, scored only 147 correct classified traces out of 200 traces on the PDC of 2017. This paper introduces the rule-based log skeleton model, which is closely related to the Declare constraint model, together with a way to classify traces using this model. This classification using log skeletons is shown to score better on the PDC of 2017 than state-of-the-art discovery algorithms: 194 out of 200. As a result, one can argue that the fully-automated algorithm to construct (or: discover) a log skeleton from an event log outperforms existing state-of-the-art fully-automated discovery algorithms.

Text Classification based on Word Subspace with Term-Frequency

arXiv.org Machine Learning

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.

r/MachineLearning - [D] Text classification on a small dataset


I am trying to perform multiclass text classification (for 24 classes) on a set documents, but I have a very small dataset currently (1200 total examples). The data collection process is a bit tedious in my case, hence the small dataset size. The best result I have achieved till now is 58% accuracy with an SVM model and a single layer CNN model. Is there any other approach I can try other than collecting more data? I have tried oversampling the training set, but it didn't seem to improve the performance.

Text Classification with TensorFlow Estimators


Note: This post was written together with the awesome Julian Eisenschlos and was originally published on the TensorFlow blog. Throughout this post we will show you how to classify text using Estimators in TensorFlow. Welcome to Part 4 of a blog series that introduces TensorFlow Datasets and Estimators. You don't need to read all of the previous material, but take a look if you want to refresh any of the following concepts. Part 1 focused on pre-made Estimators, Part 2 discussed feature columns, and Part 3 how to create custom Estimators.

Machine Learning Helps Humans Perform Text Analysis


To augment that approach, we've found that we can use machine learning to improve the semantic data models as the data set evolves. Our specific use-case is text data in millions of documents. We've found that machine learning facilitates the storage and exploration of data that would otherwise be too vast to support valuable insights. Machine Learning (ML) allows for a model to improve over time given new training data, without requiring more human effort. For example, a common text-classification benchmark task is to train a model on messages for multiple discussion board threads and then later use it to predict what the topic of discussion was (space, computers, religion, etc).