Within Machine Learning many tasks are - or can be reformulated as - classification tasks. In classification tasks we are trying to produce a model which can give the correlation between the input data and the class each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class. We need some amount of training data to train the Classifier, i.e. form a correct model of the data.
In this paper, we examine previous work on the naive Bayesian classifier and review its limitations, which include a sensitivity to correlated features. We respond to this problem by embedding the naive Bayesian induction scheme within an algorithm that c arries out a greedy search through the space of features. We hypothesize that this approach will improve asymptotic accuracy in domains that involve correlated features without reducing the rate of learning in ones that do not. We report experimental results on six natural domains, including comparisons with decision-tree induction, that support these hypotheses. In closing, we discuss other approaches to extending naive Bayesian classifiers and outline some directions for future research.
When modeling a probability distribution with a Bayesian network, we are faced with the problem of how to handle continuous variables. Most previous work has either solved the problem by discretizing, or assumed that the data are generated by a single Gaussian. In this paper we abandon the normality assumption and instead use statistical methods for nonparametric density estimation. For a naive Bayesian classifier, we present experimental results on a variety of natural and artificial domains, comparing two methods of density estimation: assuming normality and modeling each conditional distribution with a single Gaussian; and using nonparametric kernel density estimation. We observe large reductions in error on several natural and artificial data sets, which suggests that kernel estimation is a useful tool for learning Bayesian models.
Addressing the problem of spam emails in the Internet, this paper presents a comparative study on Na\"ive Bayes and Artificial Neural Networks (ANN) based modeling of spammer behavior. Keyword-based spam email filtering techniques fall short to model spammer behavior as the spammer constantly changes tactics to circumvent these filters. The evasive tactics that the spammer uses are themselves patterns that can be modeled to combat spam. It has been observed that both Na\"ive Bayes and ANN are best suitable for modeling spammer common patterns. Experimental results demonstrate that both of them achieve a promising detection rate of around 92%, which is considerably an improvement of performance compared to the keyword-based contemporary filtering approaches.
In a content based image classification system, target images are sorted by feature similarities with respect to the query (CBIR). In this paper, we propose to use new approach combining distance tangent, k-means algorithm and Bayesian network for image classification. First, we use the technique of tangent distance to calculate several tangent spaces representing the same image. The objective is to reduce the error in the classification phase. Second, we cut the image in a whole of blocks. For each block, we compute a vector of descriptors. Then, we use K-means to cluster the low-level features including color and texture information to build a vector of labels for each image. Finally, we apply five variants of Bayesian networks classifiers (Na\"ive Bayes, Global Tree Augmented Na\"ive Bayes (GTAN), Global Forest Augmented Na\"ive Bayes (GFAN), Tree Augmented Na\"ive Bayes for each class (TAN), and Forest Augmented Na\"ive Bayes for each class (FAN) to classify the image of faces using the vector of labels. In order to validate the feasibility and effectively, we compare the results of GFAN to FAN and to the others classifiers (NB, GTAN, TAN). The results demonstrate FAN outperforms than GFAN, NB, GTAN and TAN in the overall classification accuracy.