"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.
Online communication platforms are increasingly overwhelmed by rude or offensive comments, which can make people give up on discussion altogether. In response to this issue, the Jigsaw team and Google's Counter Abuse technology team collaborated with sites that have shared their comments and moderation decisions to create the Perspective API. Perspective helps online communities host better conversations by, for example, enabling moderators to more quickly identify which comments might violate their community guidelines. Several publishers have also worked on systems that provide feedback to users before they publish comments (e.g. the Coral Talk Plugin). Showing a pre-submit message that a comment might violate community guidelines, or that it will be held for review before publication, has already proven to be an effective tool to encourage users to think more carefully about how their comments might be received.
The past year has ushered in an exciting age for Natural Language Processing using deep neural networks. Research in the field of using pre-trained models have resulted in massive leap in state-of-the-art results for many of the NLP tasks, such as text classification, natural language inference and question-answering. Some of the key milestones have been ELMo, ULMFiT and OpenAI Transformer. All these approaches allow us to pre-train an unsupervised language model on large corpus of data such as all wikipedia articles, and then fine-tune these pre-trained models on downstream tasks. Perhaps the most exciting event of the year in this area has been the release of BERT, a multilingual transformer based model that has achieved state-of-the-art results on various NLP tasks.
As a data scientist, one of your most important skills should be choosing the right modeling techniques and algorithms for your problems. A few months ago I was trying to solve a text classification problem of classifying which news articles will be relevant for my customers. I had only a few thousand labeled examples so I started with simple classic machine learning modeling methods like Logistic regression on TF-IDF. These models that usually works well on text classification of long documents like news articles performed only slightly better than random on this task. After investigating the mistakes of my model I found that bag of words representation is just not enough for this task and I need a model the will use a deeper semantic understanding of the documents.
Watson Natural Language Classifier (NLC) is a text classification (aka text categorization) service that enables developers to quickly train and integrate natural language processing (NLP) capabilities into their applications. Once you have the training data, you can set up a classification model (aka a classifier) in 15 minutes or less to label text with your custom labels. In this tutorial, I will show you how to create two classifiers using publicly available Airbnb reviews data. One of the more common text classification patterns I've seen is analyzing and labeling customer reviews. Understanding unstructured customer feedback enables organizations to make informed decisions that'll improve customer experience or resolve issues faster.
Detecting and aggregating sentiments toward people, organizations, and events expressed in unstructured social media have become critical text mining operations. Early systems detected sentiments over whole passages, whereas more recently, target-specific sentiments have been of greater interest. In this paper, we present MTTDSC, a multi-task target-dependent sentiment classification system that is informed by feature representation learnt for the related auxiliary task of passage-level sentiment classification. The auxiliary task uses a gated recurrent unit (GRU) and pools GRU states, followed by an auxiliary fully-connected layer that outputs passage-level predictions. In the main task, these GRUs contribute auxiliary per-token representations over and above word embeddings. The main task has its own, separate GRUs. The auxiliary and main GRUs send their states to a different fully connected layer, trained for the main task. Extensive experiments using two auxiliary datasets and three benchmark datasets (of which one is new, introduced by us) for the main task demonstrate that MTTDSC outperforms state-of-the-art baselines. Using word-level sensitivity analysis, we present anecdotal evidence that prior systems can make incorrect target-specific predictions because they miss sentiments expressed by words independent of target.
Accurate classification of seizure types plays a crucial role in the treatment and disease management of epileptic patients. Epileptic seizure type not only impacts on the choice of drugs but also on the range of activities a patient can safely engage in. With recent advances being made towards artificial intelligence enabled automatic seizure detection, the next frontier is the automatic classification of seizure types. On that note, in this paper, we undertake the first study to explore the application of machine learning algorithms for multi-class seizure type classification. We used the recently released TUH EEG Seizure Corpus and conducted a thorough search space exploration to evaluate the performance of a combination of various pre-processing techniques, machine learning algorithms, and corresponding hyperparameters on this task. We show that our algorithms can reach a weighted F1 score of up to 0.907 thereby setting the first benchmark for scalp EEG based multi-class seizure type classification.
"All words in a natural language are ambiguous; they have multiple senses," she said in an oral history interview for the History Center of the Institute of Electrical and Electronics Engineers. "How do you find out which sense they've got in any particular use?" In 1964, Sparck Jones published "Synonymy and Semantic Classification," which is now seen as a foundational paper in the field of natural language processing. In 1972, she introduced the concept of inverse document frequency, which counts the number of times a term is used in a document in order to determine the term's importance; it, too, is a foundation of modern search engines. Sparck Jones began working on early speech recognition systems in the 1980s.
Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. This post is inspired by a Kernel on Kaggle written by Beluga, one of the top Kagglers, for a knowledge based competition. Some of the techniques/tricks I am sharing have been taken directly from that kernel so you could take a look yourself.
Classification problems having multiple classes with imbalanced dataset present a different challenge than a binary classification problem. The skewed distribution makes many conventional machine learning algorithms less effective, especially in predicting minority class examples. In order to do so, let us first understand the problem at hand and then discuss the ways to overcome those. The data set we will be using for this example is the famous "20 News groups" data set. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Hierarchical text classification, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classification due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classification remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classification. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively refine the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efficacy of our method compared with a comprehensive set of baselines.