"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.
In machine learning and statistics, classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may simply be bi-class (like identifying whether the person is male or female or that the mail is spam or non-spam) or it may be multi-class too. Some examples of classification problems are: speech recognition, handwriting recognition, bio metric identification, document classification etc. It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Online communication platforms are increasingly overwhelmed by rude or offensive comments, which can make people give up on discussion altogether. In response to this issue, the Jigsaw team and Google's Counter Abuse technology team collaborated with sites that have shared their comments and moderation decisions to create the Perspective API. Perspective helps online communities host better conversations by, for example, enabling moderators to more quickly identify which comments might violate their community guidelines. Several publishers have also worked on systems that provide feedback to users before they publish comments (e.g. the Coral Talk Plugin). Showing a pre-submit message that a comment might violate community guidelines, or that it will be held for review before publication, has already proven to be an effective tool to encourage users to think more carefully about how their comments might be received.
This report examines the Pinned AUC metric introduced in  and highlights some of its limitations. Pinned AUC provides a threshold-agnostic measure of unintended bias in a classification model, inspired by the ROC-AUC metric. However, as we highlight in this report, there are ways that the metric can obscure different kinds of unintended biases when the underlying class distributions on which bias is being measured are not carefully controlled. In , Pinned AUC is applied to a synthetically generated test set where all identity subgroups have identical representation of the classification labels. This method of controlling the class distributions avoids Pinned AUC's potential to obscure unintended biases. However, if the test data contains different distributions of classification labels between identities, Pinned AUC's measurement of bias can be skewed, either over or under representing the extent of unintended bias. In this report, the reasons for Pinned AUC's lack of robustness to variations in the class distributions are demonstrated. We also illustrate how unintended bias identified by Pinned AUC can be decomposed into the metrics presented in . To avoid requiring careful class balancing, which is hard to do on real data, instead of using Pinned AUC, the threshold agnostic metrics presented in  can be used; these are robust to variations in the class distributions and provide a more nuanced view of unintended bias.
The past year has ushered in an exciting age for Natural Language Processing using deep neural networks. Research in the field of using pre-trained models have resulted in massive leap in state-of-the-art results for many of the NLP tasks, such as text classification, natural language inference and question-answering. Some of the key milestones have been ELMo, ULMFiT and OpenAI Transformer. All these approaches allow us to pre-train an unsupervised language model on large corpus of data such as all wikipedia articles, and then fine-tune these pre-trained models on downstream tasks. Perhaps the most exciting event of the year in this area has been the release of BERT, a multilingual transformer based model that has achieved state-of-the-art results on various NLP tasks.
As a data scientist, one of your most important skills should be choosing the right modeling techniques and algorithms for your problems. A few months ago I was trying to solve a text classification problem of classifying which news articles will be relevant for my customers. I had only a few thousand labeled examples so I started with simple classic machine learning modeling methods like Logistic regression on TF-IDF. These models that usually works well on text classification of long documents like news articles performed only slightly better than random on this task. After investigating the mistakes of my model I found that bag of words representation is just not enough for this task and I need a model the will use a deeper semantic understanding of the documents.
Watson Natural Language Classifier (NLC) is a text classification (aka text categorization) service that enables developers to quickly train and integrate natural language processing (NLP) capabilities into their applications. Once you have the training data, you can set up a classification model (aka a classifier) in 15 minutes or less to label text with your custom labels. In this tutorial, I will show you how to create two classifiers using publicly available Airbnb reviews data. One of the more common text classification patterns I've seen is analyzing and labeling customer reviews. Understanding unstructured customer feedback enables organizations to make informed decisions that'll improve customer experience or resolve issues faster.
Detecting and aggregating sentiments toward people, organizations, and events expressed in unstructured social media have become critical text mining operations. Early systems detected sentiments over whole passages, whereas more recently, target-specific sentiments have been of greater interest. In this paper, we present MTTDSC, a multi-task target-dependent sentiment classification system that is informed by feature representation learnt for the related auxiliary task of passage-level sentiment classification. The auxiliary task uses a gated recurrent unit (GRU) and pools GRU states, followed by an auxiliary fully-connected layer that outputs passage-level predictions. In the main task, these GRUs contribute auxiliary per-token representations over and above word embeddings. The main task has its own, separate GRUs. The auxiliary and main GRUs send their states to a different fully connected layer, trained for the main task. Extensive experiments using two auxiliary datasets and three benchmark datasets (of which one is new, introduced by us) for the main task demonstrate that MTTDSC outperforms state-of-the-art baselines. Using word-level sensitivity analysis, we present anecdotal evidence that prior systems can make incorrect target-specific predictions because they miss sentiments expressed by words independent of target.
Accurate classification of seizure types plays a crucial role in the treatment and disease management of epileptic patients. Epileptic seizure type not only impacts on the choice of drugs but also on the range of activities a patient can safely engage in. With recent advances being made towards artificial intelligence enabled automatic seizure detection, the next frontier is the automatic classification of seizure types. On that note, in this paper, we undertake the first study to explore the application of machine learning algorithms for multi-class seizure type classification. We used the recently released TUH EEG Seizure Corpus and conducted a thorough search space exploration to evaluate the performance of a combination of various pre-processing techniques, machine learning algorithms, and corresponding hyperparameters on this task. We show that our algorithms can reach a weighted F1 score of up to 0.907 thereby setting the first benchmark for scalp EEG based multi-class seizure type classification.
"All words in a natural language are ambiguous; they have multiple senses," she said in an oral history interview for the History Center of the Institute of Electrical and Electronics Engineers. "How do you find out which sense they've got in any particular use?" In 1964, Sparck Jones published "Synonymy and Semantic Classification," which is now seen as a foundational paper in the field of natural language processing. In 1972, she introduced the concept of inverse document frequency, which counts the number of times a term is used in a document in order to determine the term's importance; it, too, is a foundation of modern search engines. Sparck Jones began working on early speech recognition systems in the 1980s.
Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. This post is inspired by a Kernel on Kaggle written by Beluga, one of the top Kagglers, for a knowledge based competition. Some of the techniques/tricks I am sharing have been taken directly from that kernel so you could take a look yourself.