Patrick Pantel and Dekang Lin Department of Computer Science University of Manitoba Winnipeg, Manitoba Canada R3T 2N2 Abstract We present a simple, yet highly accurate, spam filtering program, called SpamCop, which is able to identify about 92% of the spams while misclassifying only about 1.16% of the nonspam emails. SpamCop treats an email message as a multiset of words and employs a na'fve Bayes algorithm to determine whether or not a message is likely to be a spam. Compared with keyword-spotting rules, the probabilistic approach taken in SpamCop not only offers high accuracy, but also overcomes the brittleness suffered by the keyword spotting approach. Introduction With the explosive growth of the Internet, so too comes the proliferation of spams. Spammers collect a plethora of email addresses without the consent of the owners of these addresses.
In order to investigate the breast cancer prediction problem on the aging population with the grades of DCIS, we conduct a tree augmented naive Bayesian network experiment trained and tested on a large clinical dataset including consecutive diagnostic mammography examinations, consequent biopsy outcomes and related cancer registry records in the population of women across all ages. The aggregated results of our ten-fold cross validation method recommend a biopsy threshold higher than 2% for the aging population.
This paper describes an effort to measure the effectiveness of tutor help in an intelligent tutoring system. Although conventional pre-and post-test experiments can determine whether tutor help is effective, they are expensive to conduct. Furthermore, pre-and post-test experiments often do not model student knowledge explicitly and thus are ignoring a source of information: students often request help about words they do not know. Therefore, we construct a dynamic Bayes net (which we call the Help model) that models tutor help and student knowledge in one coherent framework. The Help model distinguishes two different effects of help: scaffolding immediate performance vs. teaching persistent knowledge that improves long term performance. We train the Help model to fit student performance data gathered from usage of the Reading Tutor (Mostow & Aist, 2001). The parameters of the trained model suggest that students benefit from both the scaffolding and teaching effects of help. That is, students are more likely to perform correctly on the current attempt and learn persistent knowledge if tutor help is provided. Thus, our framework is able to distinguish two types of influence that tutor help has on the student, and can determine whether help helps learning without an explicit controlled study.
Cross-validation (CV) is a technique for evaluating the ability of statistical models/learning systems based on a given data set. Despite its wide applicability, the rather heavy computational cost can prevent its use as the system size grows. To resolve this difficulty in the case of Bayesian linear regression, we develop a formula for evaluating the leave-one-out CV error approximately without actually performing CV. The usefulness of the developed formula is tested by statistical mechanical analysis for a synthetic model. This is confirmed by application to a real-world supernova data set as well.