I am currently trying to work out a way to accurately classify documents into 3 different categories. The documents are rather lengthy, usually several thousands of words, unstructured and pretty much entirely full sentences. There are some keywords that increases the probability of the document belonging to one particular category, but not all of them are known. Until now I have tried to clean the documents by getting rid of punctuation, common stop words and non-alphabetical strings. Since only a small part of the text is relevant, I was planning to try a tf-idf process to identify significant words within the documents.