Text Classification
Walmart Competition: Trip Type Classification
They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program). Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience. Walmart's trip types are created from a combination of existing customer insights and purchase history data. The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.
[Project] Document Classification โข /r/MachineLearning
I am currently trying to work out a way to accurately classify documents into 3 different categories. The documents are rather lengthy, usually several thousands of words, unstructured and pretty much entirely full sentences. There are some keywords that increases the probability of the document belonging to one particular category, but not all of them are known. Until now I have tried to clean the documents by getting rid of punctuation, common stop words and non-alphabetical strings. Since only a small part of the text is relevant, I was planning to try a tf-idf process to identify significant words within the documents.
Text Classification & Sentiment Analysis tutorial / blog
Natural Language Processing (NLP) is a vast area of Computer Science that is concerned with the interaction between Computers and Human Language[1]. Within NLP many tasks are โ or can be reformulated as โ classification tasks. In classification tasks we are trying to produce a classification function which can give the correlation between a certain'feature' and a class . This Classifier first has to be trained with a training dataset, and then it can be used to actually classify documents. Training means that we have to determine its model parameters.
Text Analysis 101; A Basic Understanding for Business Users: Document Classification - AYLIEN
The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content. Broadly speaking, there are two classes of ML techniques: supervised and unsupervised. In supervised methods, a model is created based on previous observations i.e. a training set.
Classifications in R: Response Modeling/Credit Scoring/Credit Rating using Machine Learning Techniques
This article was written by Ariful Mondal. Artful is a senior manager, data science and big data analytics consultant at Tata Consultancy Services. This is an attempt to showcase some worked out examples of Machine Learning (ML) use German Credit Data. Though we have selected credit scoring problem as a case study in this article, the same process will be applicable for wide range of classification or regression problems "Response modeling", "Risk Management", "Attrition/Churn management", "Cross-Sell/Up-Sell", "usage Patterns", "Net Present Value", "Life Time Value", "Predictive Maintenance and condition based monitoring", "Warranty", "Reliability", "Failure Prediction", "Image/Video Processing", "Crime", "Medical Experiments", "Hidden pattern recognition" . The basic difference of traditional modeling and machine learning is that "in traditional modeling we intend to set up a modeling framework and try to establish relationships while in machine learning we allow the model to learn from the data by understanding the hidden patterns".
Lightweight Random Indexing for Polylingual Text Classification
Moreo Fernรกndez, Alejandro, Esuli, Andrea, Sebastiani, Fabrizio
Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the |L| monolingual classifiers by also leveraging the training documents written in the other (|L| โ 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly |L| times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.
Most popular kaggle competition solutions
Large Scale Hierarchical Text Classification is a document classification challenge to classify a given Wikipedia document into one of the 325,056 categories. Wikipedia has created this very large dataset. The dataset is multi-class, multi-label and hierarchical. The numbers of categories were somewhere around 325,000 and the numbers documents size is 2,400,000. This challenge builds upon a series of successful challenges on large-scale hierarchical text classification. Demokritos will give more information on this dataset at http://lshtc.iit.demokritos.gr/
Text Classification in Microsoft's Azure Machine Learning Studio CrowdFlower
There are lots of great tools out there for building machine learning models and data processing pipelines. Most of these tools, like R, scikit-learn, spark.ml At CrowdFlower, we use many of these resources to varying degrees. However, we also recognize that many people will prefer to approach model building and deployment in a hands-on integrated environment supported by a graphical interface. To this end, we are pleased to showcase an end-to-end model construction process in Microsoft's Azure Machine Learning Studio.