AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

[R] Text Classification Using Label Names Only: A Language Model Self-Training Approach

#artificialintelligenceOct-16-2020, 02:05:56 GMT

Abstract: Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training.

machine learning, natural language, text classification, (5 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

TF-IDF Refresher

#artificialintelligenceOct-13-2020, 01:31:38 GMT

Term Frequency-Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document, in a collection or corpus. Simply put, TF-IDF shows the relative importance of a word or words to a document, given a collection of documents. Note that before we can do text-classification, the text must be translated into some form of numerical representation, a process known as text-embedding. The resulting numerical representation which is usually in the form of vectors can then be used as input to a wide range of classification models. TF-IDF is the most popular approach to embed texts into numerical vectors for modeling, information retrieval and text-mining.

artificial intelligence, natural language, text classification, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.61)

Add feedback

Deploying a Text Classification Model in Python

#artificialintelligenceOct-1-2020, 15:11:17 GMT

This article is the last of a series in which I cover the whole process of developing a machine learning project. If you have not read the previous two articles, I strongly encourage you to do it here and here. The project involves the creation of a real-time web application that gathers data from several newspapers and shows a summary of the different topics that are being discussed in the news articles. This is achieved with a supervised machine learning classification model that is able to predict the category of a given news article, a web scraping method that gets the latest news from the newspapers, and an interactive web application that shows the obtained results to the user. As I explained in the first post of this series, the reason I'm writing these articles is because I've noticed that most of the times, the content published on the internet, books or literature regarding data science focus on the following: we have a labeled dataset and we train models to obtain a performance metric.

machine learning, natural language, text classification, (14 more...)

#artificialintelligence

Country:

North America (0.05)
Europe (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.52)

Add feedback

Immigration Document Classification and Automated Response Generation

Mukherjee, Sourav, Oates, Tim, DiMascio, Vince, Jean, Huguens, Ares, Rob, Widmark, David, Harder, Jaclyn

arXiv.org Machine LearningSep-29-2020

In this paper, we consider the problem of organizing supporting documents vital to U.S. work visa petitions, as well as responding to Requests For Evidence (RFE) issued by the U.S.~Citizenship and Immigration Services (USCIS). Typically, both processes require a significant amount of repetitive manual effort. To reduce the burden of mechanical work, we apply machine learning methods to automate these processes, with humans in the loop to review and edit output for submission. In particular, we use an ensemble of image and text classifiers to categorize supporting documents. We also use a text classifier to automatically identify the types of evidence being requested in an RFE, and used the identified types in conjunction with response templates and extracted fields to assemble draft responses. Empirical results suggest that our approach achieves considerable accuracy while significantly reducing processing time.

classifier, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2010.01997

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
South America > Brazil (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(13 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Immigration & Customs (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)

Add feedback

Text Classification with Novelty Detection

Qin, Qi, Hu, Wenpeng, Liu, Bing

arXiv.org Machine LearningSep-23-2020

This paper studies the problem of detecting novel or unexpected instances in text classification. In traditional text classification, the classes appeared in testing must have been seen in training. However, in many applications, this is not the case because in testing, we may see unexpected instances that are not from any of the training classes. In this paper, we propose a significantly more effective approach that converts the original problem to a pair-wise matching problem and then outputs how probable two instances belong to the same class. Under this approach, we present two models. The more effective model uses two embedding matrices of a pair of instances as two channels of a CNN. The output probabilities from such pairs are used to judge whether a test instance is from a seen class or is novel/unexpected. Experimental results show that the proposed method substantially outperforms the state-of-the-art baselines.

data mining, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2009.11119

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

Evensen, Sara, Ge, Chang, Choi, Dongjin, Demiralp, Çağatay

arXiv.org Machine LearningSep-15-2020

Data programming is a programmatic weak supervision approach to efficiently curate large-scale labeled training data. Writing data programs (labeling functions) requires, however, both programming literacy and domain expertise. Many subject matter experts have neither programming proficiency nor time to effectively write data programs. Furthermore, regardless of one's expertise in coding or machine learning, transferring domain expertise into labeling functions by enumerating rules and thresholds is not only time consuming but also inherently difficult. Here we propose a new framework, data programming by demonstration (DPBD), to generate labeling rules using interactive demonstrations of users. DPBD aims to relieve the burden of writing labeling functions from users, enabling them to focus on higher-level semantics such as identifying relevant signals for labeling tasks. We operationalize our framework with Ruler, an interactive system that synthesizes labeling rules for document classification by using span-level annotations of users on document examples. We compare Ruler with conventional data programming through a user study conducted with 10 data scientists creating labeling functions for sentiment and spam classification tasks. We find that Ruler is easier to use and learn and offers higher overall satisfaction, while providing discriminative model performances comparable to ones achieved by conventional data programming.

demonstration, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2009.01444

Genre:

Questionnaire & Opinion Survey (0.88)
Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)

Add feedback

Text Classification with NO model training

#artificialintelligenceSep-13-2020, 14:10:10 GMT

NLP (Natural Language Processing) is the field of artificial intelligence that studies the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP is often applied for classifying text data. Text classification is the problem of assigning categories to text data according to its content. In order to carry out a classification use case, you need a labeled dataset for machine learning models training. So what happens if you don't have one?

machine learning, model training, text classification, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.64)

Add feedback

Improving Indonesian Text Classification Using Multilingual Language Model

Putra, Ilham Firdausi, Purwarianti, Ayu

arXiv.org Artificial IntelligenceSep-11-2020

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2009.05713

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

How Document Classification Can Improve Business Processes

#artificialintelligenceSep-7-2020, 14:55:27 GMT

The process of labeling documents into categories based on the type of the content is known as document classification. It can also be defined as the process of assigning one or more classes or categories to a document (depending on the type of content) to make it easy to sort and manage images, texts, and videos. Document classification can be done using artificial intelligence, machine learning, and python. This classification can be done in two ways: manually or automatically. The former gives humans full authority over the classification.

classification, natural language, text classification, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

Add feedback

Tutorial On Keras Tokenizer For Text Classification in NLP

#artificialintelligenceAug-31-2020, 16:45:25 GMT

Now we will compile the model using optimizer as stochastic gradient descent, loss as cross-entropy and metrics to measure the performance would be accuracy. After compiling we will train the model and check the performance on validation data. We are taking a batch size of 64 and epochs to be 10.

machine learning, natural language, test, (15 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.41)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback