Text Classification
[R] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Abstract: Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training.
TF-IDF Refresher
Term Frequency-Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document, in a collection or corpus. Simply put, TF-IDF shows the relative importance of a word or words to a document, given a collection of documents. Note that before we can do text-classification, the text must be translated into some form of numerical representation, a process known as text-embedding. The resulting numerical representation which is usually in the form of vectors can then be used as input to a wide range of classification models. TF-IDF is the most popular approach to embed texts into numerical vectors for modeling, information retrieval and text-mining.
Deploying a Text Classification Model in Python
This article is the last of a series in which I cover the whole process of developing a machine learning project. If you have not read the previous two articles, I strongly encourage you to do it here and here. The project involves the creation of a real-time web application that gathers data from several newspapers and shows a summary of the different topics that are being discussed in the news articles. This is achieved with a supervised machine learning classification model that is able to predict the category of a given news article, a web scraping method that gets the latest news from the newspapers, and an interactive web application that shows the obtained results to the user. As I explained in the first post of this series, the reason I'm writing these articles is because I've noticed that most of the times, the content published on the internet, books or literature regarding data science focus on the following: we have a labeled dataset and we train models to obtain a performance metric.
Immigration Document Classification and Automated Response Generation
Mukherjee, Sourav, Oates, Tim, DiMascio, Vince, Jean, Huguens, Ares, Rob, Widmark, David, Harder, Jaclyn
In this paper, we consider the problem of organizing supporting documents vital to U.S. work visa petitions, as well as responding to Requests For Evidence (RFE) issued by the U.S.~Citizenship and Immigration Services (USCIS). Typically, both processes require a significant amount of repetitive manual effort. To reduce the burden of mechanical work, we apply machine learning methods to automate these processes, with humans in the loop to review and edit output for submission. In particular, we use an ensemble of image and text classifiers to categorize supporting documents. We also use a text classifier to automatically identify the types of evidence being requested in an RFE, and used the identified types in conjunction with response templates and extracted fields to assemble draft responses. Empirical results suggest that our approach achieves considerable accuracy while significantly reducing processing time.
Text Classification with Novelty Detection
Qin, Qi, Hu, Wenpeng, Liu, Bing
This paper studies the problem of detecting novel or unexpected instances in text classification. In traditional text classification, the classes appeared in testing must have been seen in training. However, in many applications, this is not the case because in testing, we may see unexpected instances that are not from any of the training classes. In this paper, we propose a significantly more effective approach that converts the original problem to a pair-wise matching problem and then outputs how probable two instances belong to the same class. Under this approach, we present two models. The more effective model uses two embedding matrices of a pair of instances as two channels of a CNN. The output probabilities from such pairs are used to judge whether a test instance is from a seen class or is novel/unexpected. Experimental results show that the proposed method substantially outperforms the state-of-the-art baselines.
Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions
Evensen, Sara, Ge, Chang, Choi, Dongjin, Demiralp, รaฤatay
Data programming is a programmatic weak supervision approach to efficiently curate large-scale labeled training data. Writing data programs (labeling functions) requires, however, both programming literacy and domain expertise. Many subject matter experts have neither programming proficiency nor time to effectively write data programs. Furthermore, regardless of one's expertise in coding or machine learning, transferring domain expertise into labeling functions by enumerating rules and thresholds is not only time consuming but also inherently difficult. Here we propose a new framework, data programming by demonstration (DPBD), to generate labeling rules using interactive demonstrations of users. DPBD aims to relieve the burden of writing labeling functions from users, enabling them to focus on higher-level semantics such as identifying relevant signals for labeling tasks. We operationalize our framework with Ruler, an interactive system that synthesizes labeling rules for document classification by using span-level annotations of users on document examples. We compare Ruler with conventional data programming through a user study conducted with 10 data scientists creating labeling functions for sentiment and spam classification tasks. We find that Ruler is easier to use and learn and offers higher overall satisfaction, while providing discriminative model performances comparable to ones achieved by conventional data programming.
Text Classification with NO model training
NLP (Natural Language Processing) is the field of artificial intelligence that studies the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP is often applied for classifying text data. Text classification is the problem of assigning categories to text data according to its content. In order to carry out a classification use case, you need a labeled dataset for machine learning models training. So what happens if you don't have one?
Improving Indonesian Text Classification Using Multilingual Language Model
Putra, Ilham Firdausi, Purwarianti, Ayu
Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.
How Document Classification Can Improve Business Processes
The process of labeling documents into categories based on the type of the content is known as document classification. It can also be defined as the process of assigning one or more classes or categories to a document (depending on the type of content) to make it easy to sort and manage images, texts, and videos. Document classification can be done using artificial intelligence, machine learning, and python. This classification can be done in two ways: manually or automatically. The former gives humans full authority over the classification.