Improving Naive Bayes for Regression with Optimised Artificial Surrogate Data

Mayo, Michael, Frank, Eibe

arXiv.org Artificial Intelligence 

The typical pipeline for a supervised machine learning project involves firstly the collection of a significant sample of labelled examples typically referred to as training data. Depending on whether the labels are continuous or categorical, the supervised learning task is known as regression or classification respectively. Next, once the training data is sufficiently clean and complete, it is used to directly build a predictive model using the machine learning algorithm of choice. The predictive model is then used to label new unlabelled examples, and if the labels of the new examples are known a priori by the user (but not used by the learning algorithm) then the predictive accuracy of the model can be evaluated. Different models can therefore be directly compared. In the usual case, the training data is "real", i.e. the model is learned directly from labelled examples that were collected specifically for that purpose. However, quite frequently, modifications are made to the training data after it is collected. For example, it is standard practice to remove outlier examples and normalise numeric values. Moreover, the machine learning algorithm itself may specify modifications to the training data.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found