Collaborating Authors

Text Analysis 101; A Basic Understanding for Business Users: Document Classification


This blog was originally posted as part of our Text Analysis 101 blog series. It aims to explain how the classification of text works as part of Natural Language Processing. The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content.

Learning to Identify Known and Unknown Classes: A Case Study in Open World Malware Classification

AAAI Conferences

In this paper we propose an open world malware classification. Our approach is not only able to identify known families of malware but is also able to distinguish them from malware families that were never seen before. Our proposed approach is more accurate and scales better on two evaluation datasets when compared to existing algorithms.

Distribution Networks for Open Set Learning Machine Learning

In open set learning, a model must be able to generalize to novel classes when it encounters a sample that does not belong to any of the classes it has seen before. Open set learning poses a realistic learning scenario that is receiving growing attention. Existing studies on open set learning mainly focused on detecting novel classes, but few studies tried to model them for differentiating novel classes. We recognize that novel classes should be different from each other, and propose distribution networks for open set learning that can learn and model different novel classes. We hypothesize that, through a certain mapping, samples from different classes with the same classification criterion should follow different probability distributions from the same distribution family. We estimate the probability distribution for each known class and a novel class is detected when a sample is not likely to belong to any of the known distributions. Due to the large feature dimension in the original feature space, the probability distributions in the original feature space are difficult to estimate. Distribution networks map the samples in the original feature space to a latent space where the distributions of known classes can be jointly learned with the network. In the latent space, we also propose a distribution parameter transfer strategy for novel class detection and modeling. By novel class modeling, the detected novel classes can serve as known classes to the subsequent classification. Our experimental results on image datasets MNIST and CIFAR10 and text dataset Ohsumed show that the distribution networks can detect novel classes accurately and model them well for the subsequent classification tasks.

Teacher-Explorer-Student Learning: A Novel Learning Method for Open Set Recognition Artificial Intelligence

If an unknown example that is not seen during training appears, most recognition systems usually produce overgeneralized results and determine that the example belongs to one of the known classes. To address this problem, teacher-explorer-student (T/E/S) learning, which adopts the concept of open set recognition (OSR) that aims to reject unknown samples while minimizing the loss of classification performance on known samples, is proposed in this study. In this novel learning method, overgeneralization of deep learning classifiers is significantly reduced by exploring various possibilities of unknowns. Here, the teacher network extracts some hints about unknowns by distilling the pretrained knowledge about knowns and delivers this distilled knowledge to the student. After learning the distilled knowledge, the student network shares the learned information with the explorer network. Then, the explorer network shares its exploration results by generating unknown-like samples and feeding the samples to the student network. By repeating this alternating learning process, the student network experiences a variety of synthetic unknowns, reducing overgeneralization. Extensive experiments were conducted, and the experimental results showed that each component proposed in this paper significantly contributes to the improvement in OSR performance. As a result, the proposed T/E/S learning method outperformed current state-of-the-art methods.

[Question] How to deal with overlapping training data in classification. • /r/MachineLearning


I have some data that I need to classify into three groups (Q 0,1,2). The Q 0 training data is realatively well seperated from the other training data, but the Q 1, and Q 2 have a good amount of overlaps. See this figure for an example. I'm working with Scikit-Learn, and I've tried Random Forests, Extremely Randomized Forests, and SVM. In the testing step (before I apply it to unknown data) I get pretty good results (The recall and precision are both 60%).