I am currently trying to work out a way to accurately classify documents into 3 different categories. The documents are rather lengthy, usually several thousands of words, unstructured and pretty much entirely full sentences. There are some keywords that increases the probability of the document belonging to one particular category, but not all of them are known. Until now I have tried to clean the documents by getting rid of punctuation, common stop words and non-alphabetical strings. Since only a small part of the text is relevant, I was planning to try a tf-idf process to identify significant words within the documents.
K-means is a clustering algorithm and not classification method. On the other hand, SVM is a classification method. We do clustering when we don't have class labels and perform classification when we have class labels. Clustering is a unsupervised learning technique and classification is a supervised learning technique. Therefore, comparing both of them are comparing apple and oranges.
The number of possible methods of generalizing binary classification to multi-class classification increases exponentially with the number of class labels. Often, the best method of doing so will be highly problem dependent. Here we present classification software in which the partitioning of multi-class classification problems into binary classification problems is specified using a recursive control language.
This matrix can be used for 2-class problems where it is very easy to understand, but can easily be applied to problems with 3 or more class values, by adding more rows and columns to the confusion matrix. Let's make this explanation of creating a confusion matrix concrete with an example. Let's pretend we have a two-class classification problem of predicting whether a photograph contains a man or a woman. We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm. Let's start off and calculate the classification accuracy for this set of predictions. The algorithm made 7 of the 10 predictions correct with an accuracy of 70%. First, we must calculate the number of correct predictions for each class. Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value.
Computer vision will play a crucial role in visual search, self-driving cars, medicine and many other applications. Success will hinge on collecting and labeling large labeled datasets which will be used to train and test new algorithms. One area that has seen great advances over the last five years is image classification i.e. determining automatically what objects are present in an image. Existing image classification datasets have an equal number of images for each class. However, the real world is long tailed: only a small percentage of classes are likely to be observed; most classes are infrequent or rare.