Let's step back first and frame our classification problem in Bayesian terms -- where we have a set of prior beliefs and update our beliefs as we observe and collect evidence. In statistics, everything revolves around hypotheses. We make a hypothesis (an informed guess) about how the world works, and then we go about collecting evidence to test that hypothesis (if you would like to know the details, I wrote a post about hypothesis testing here). Classification models can be framed as a hypothesis as well. Let's first write out the objective and variables of our classification problem: OK, so that's classification -- now let's examine classification through a Bayesian lens.
In this tutorial we will discuss about Naive Bayes text classifier. Naive Bayes is one of the simplest classifiers that one can use because of the simple mathematics that are involved and due to the fact that it is easy to code with every standard programming language including PHP, C#, JAVA etc. Update: The Datumbox Machine Learning Framework is now open-source and free to download. Note that some of the techniques described below are used on Datumbox's Text Analysis service and they power up our API. The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions. It is one of the most basic text classification techniques with various applications in email spam detection, personal email sorting, document categorization, sexually explicit content detection, language detection and sentiment detection.
Classification is a predictive modeling problem that involves assigning a label to a given input data sample. The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample. Bayes Theorem provides a principled way for calculating this conditional probability, although in practice requires an enormous number of samples (very large-sized dataset) and is computationally expensive. Instead, the calculation of Bayes Theorem can be simplified by making some assumptions, such as each input variable is independent of all other input variables. Although a dramatic and unrealistic assumption, this has the effect of making the calculations of the conditional probability tractable and results in an effective classification model referred to as Naive Bayes.
We introduce the notion of restricted Bayes optimal classifiers. These classifiers attempt to combine the flexibility of the generative approach to classification with the high accuracy associated with discriminative learning. They first create a model of the joint distribution over class labels and features. Instead of choosing the decision boundary induced directly from the model, they restrict the allowable types of decision boundaries and learn the one that minimizes the probability of misclassification relative to the estimated joint distribution. In this paper, we investigate two particular instantiations of this approach. The first uses a nonparametric density estimator -- Parzen Windows with Gaussian kernels -- and hyperplane decision boundaries. We show that the resulting classifier is asymptotically equivalent to a maximal margin hyperplane classifier, a highly successful discriminative classifier. We therefore provide an alternative justification for maximal margin hyperplane classifiers. The second instantiation uses a mixture of Gaussians as the estimated density; in experiments on real-world data, we show that this approach allows data with missing values to be handled in a principled manner, leading to improved performance over regular discriminative approaches.
Last summer, I was at a conference having lunch with Hal Daume III when we got to talking about how "Bayesian" can be a funny and ambiguous term. It seems like the definition should be straightforward: "following the work of English mathematician Rev. Thomas Bayes," perhaps, or even "uses Bayes' theorem." But many methods bearing the reverend's name or using his theorem aren't even considered "Bayesian" by his most religious followers. Why is it that Bayesian networks, for example, aren't considered… y'know… Bayesian? As I've read more outside the fields of machine learning and natural language processing -- from psychometrics and environmental biology to hackers who dabble in data science -- I've noticed three broad uses of the term "Bayesian."