Multiclass Classification via Class-Weighted Nearest Neighbors
Khim, Justin, Xu, Ziyu, Singh, Shashank
Classification is a fundamental problem in statistics and machine learning that arises in many scientific and engineering problems. Scientific applications include identifying plant and animal species from body measurements, determining cancer types based on gene expression, and satellite image processing (Fisher, 1936, 1938; Khan et al., 2001; Lee et al., 2004); in modern engineering contexts, credit card fraud detection, handwritten digit recognition, word sense disambiguation, and object detection in images are all examples of classification tasks. These applications have brought two new challenges: multiclass classification with a potentially large number of classes and imbalanced data. For example, in online retailing, websites have hundreds of thousands or millions of products, and they may like to categorize these products within a preexisting taxonomy based on product descriptions (Lin et al., 2018). While the number of classes alone makes the problem difficult, an added difficulty with text data is that it is usually highly imbalanced, meaning that a few classes may constitute a large fraction of the data while many classes have only a few examples. In fact, Feldman (2019) notes that if the data follows the classical Zipf distribution for text data (Zipf, 1936), i.e., the class probabilities satisfy a power-law distribution, then up to 35% of seen examples may appear only once in the training data. Additionally, natural image data also seems to have the problems of many classes and imbalanced data (Salakhutdinov et al., 2011; Zhu et al., 2014). Focusing on the problem of imbalanced data, researchers have found that a few heuristics help "do better," and the most principled and studied of these is weighting. There are a number of forms of weighting; we consider the most basic in which we incur a loss of weight for misclassifying an example of class and refer to this method as class-weighting.
Apr-9-2020
- Country:
- Asia (0.04)
- North America > United States
- California > Orange County
- Irvine (0.04)
- New York (0.04)
- North Dakota > McKenzie County (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Texas (0.04)
- California > Orange County
- Genre:
- Research Report (0.64)
- Workflow (0.46)
- Industry:
- Technology: