Su, Meng
Improved Naive Bayes with Mislabeled Data
Zeng, Qianhan, Zhu, Yingqiu, Zhu, Xuening, Wang, Feifei, Zhao, Weichen, Sun, Shuning, Su, Meng, Wang, Hansheng
Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generating mechanism of incorrect labels, we optimize the corresponding log-likelihood function iteratively by using an EM algorithm. Our simulation and experiment results show that the improved Naive Bayes method greatly improves the performances of the Naive Bayes method with mislabeled data.
Applying Diffusion Distance for Multi-Scale Analysis of An Experience Space
Su, Meng (The Pennsylvania State University) | Fan, Xiaocong (The Pennsylvania State University) | Ge, WeiLi (Zhengzhou University)
Diffusion distance has been shown to be significantlymore effective than Euclidean distance in multi-scalerecognition of similar experiences in Recognition-Primed Decision making In this paper, we first examine the experience data set used inthe previous study. The visualization of the data set(using the first three dominant eigenvectors of the diffusion space) suggests the applicability of the diffusion approach. Second, we investigate two approaches to the computation of diffusion distance: Spectrum based and Probability-Matching based. Specifically, by ‘Spectrumbased’ approach we refer to the one derived in terms of the eigenvalues/eigenvectors of the normalized diffusion matrix. We use the term ‘Probability-Matching’ to refer to the use of various probability distances, where the original L2 diffusion distance is treated as a special case. Our preliminary result indicates that the performance of using L2 diffusion distance at least is tied with the use of Spectrum based distance. Furthermore, when spectrum based approach is applied, we have to use the embedding and extending techniques for labeling new experience data, while such recomputation is not necessary when the L2 diffusion distance is used. We do not need to recompute the diffusion matrix, hence the diffusion map each time when adding a new data. It is more natural and robust especially for labeling new single experience data. The numerical examples also show the improvement on the performance. We are currently working on several other Probability-Matching approaches (e.g. the Earth-Mover’s Distance).