population type
Bridging Cost-sensitive and Neyman-Pearson Paradigms for Asymmetric Binary Classification
Li, Wei Vivian, Tong, Xin, Li, Jingyi Jessica
Asymmetric binary classification problems, in which the type I and II errors have unequal severity, are ubiquitous in real-world applications. To handle such asymmetry, researchers have developed the cost-sensitive and Neyman-Pearson paradigms for training classifiers to control the more severe type of classification error, say the type I error. The cost-sensitive paradigm is widely used and has straightforward implementations that do not require sample splitting; however, it demands an explicit specification of the costs of the type I and II errors, and an open question is what specification can guarantee a high-probability control on the population type I error. In contrast, the Neyman-Pearson paradigm can train classifiers to achieve a high-probability control of the population type I error, but it relies on sample splitting that reduces the effective training sample size. Since the two paradigms have complementary strengths, it is reasonable to combine their strengths for classifier construction. In this work, we for the first time study the methodological connections between the two paradigms, and we develop the TUBE-CS algorithm to bridge the two paradigms from the perspective of controlling the population type I error.
Curse of Dimensionality, and How to Manage It
Data scientists are often drawn to the profession excited by the chance to spend their days on cutting-edge research and development and working with fantastic new machine learning algorithms. While this is indeed a fun and exciting part of the job, as most data scientists in the field will tell you, much of one's time is spent cleaning, transforming, and engineering the data. The common wisdom is that, given enough data, most standard algorithms will be able to (eventually) detect the signal. This is the thesis that in large N, when you have enough data points, all machine learning algorithms tend to converge on the same answer.