Goto

Collaborating Authors

 imbalanced classification



Learning to Re-weight Examples with Optimal Transport for Imbalanced Classification

Neural Information Processing Systems

Imbalanced data pose challenges for deep learning based classification models. One of the most widely-used approaches for tackling imbalanced data is re-weighting, where training samples are associated with different weights in the loss function. Most of existing re-weighting approaches treat the example weights as the learnable parameter and optimize the weights on the meta set, entailing expensive bilevel optimization. In this paper, we propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view. Specifically, we view the training set as an imbalanced distribution over its samples, which is transported by OT to a balanced distribution obtained from the meta set. The weights of the training samples are the probability mass of the imbalanced distribution andlearned by minimizing the OT distance between the two distributions. Compared with existing methods, our proposed one disengages the dependence of the weight learning on the concerned classifier at each iteration. Experiments on image, text and point cloud datasets demonstrate that our proposed re-weighting method has excellent performance, achieving state-of-the-art results in many cases andproviding a promising tool for addressing the imbalanced classification issue.


Bias-Corrected Data Synthesis for Imbalanced Learning

Lyu, Pengfei, Ma, Zhengchi, Zhang, Linjun, Zhang, Anru R.

arXiv.org Machine Learning

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.



EIoU-EMC: A Novel Loss for Domain-specific Nested Entity Recognition

Zhang, Jian, Zhang, Tianqing, Li, Qi, Wang, Hongwei

arXiv.org Artificial Intelligence

In recent years, research has mainly focused on the general NER task. There still have some challenges with nested NER task in the specific domains. Specifically, the scenarios of low resource and class imbalance impede the wide application for biomedical and industrial domains. In this study, we design a novel loss EIoU-EMC, by enhancing the implement of Intersection over Union loss and Multiclass loss. Our proposed method specially leverages the information of entity boundary and entity classification, thereby enhancing the model's capacity to learn from a limited number of data samples. To validate the performance of this innovative method in enhancing NER task, we conducted experiments on three distinct biomedical NER datasets and one dataset constructed by ourselves from industrial complex equipment maintenance documents. Comparing to strong baselines, our method demonstrates the competitive performance across all datasets. During the experimental analysis, our proposed method exhibits significant advancements in entity boundary recognition and entity classification. Our code are available here.


A binary PSO based ensemble under-sampling model for rebalancing imbalanced training data

Li, Jinyan, Wu, Yaoyang, Fong, Simon, Tallón-Ballesteros, Antonio J., Yang, Xin-she, Mohammed, Sabah, Wu, Feng

arXiv.org Artificial Intelligence

Ensemble technique and under-sampling technique are both effective tools used for imbalanced dataset classification problems. In this paper, a novel ensemble method combining the advantages of both ensemble learning for biasing classifiers and a new under-sampling method is proposed. The under-sampling method is named Binary PSO instance selection; it gathers with ensemble classifiers to find the most suitable length and combination of the majority class samples to build a new dataset with minority class samples. The proposed method adopts multi-objective strategy, and contribution of this method is a notable improvement of the performances of imbalanced classification, and in the meantime guaranteeing a best integrity possible for the original dataset. We experimented the proposed method and compared its performance of processing imbalanced datasets with several other conventional basic ensemble methods. Experiment is also conducted on these imbalanced datasets using an improved version where ensemble classifiers are wrapped in the Binary PSO instance selection. According to experimental results, our proposed methods outperform single ensemble methods, state-of-the-art under-sampling methods, and also combinations of these methods with the traditional PSO instance selection algorithm.


Learning to Re-weight Examples with Optimal Transport for Imbalanced Classification

Neural Information Processing Systems

Imbalanced data pose challenges for deep learning based classification models. One of the most widely-used approaches for tackling imbalanced data is re-weighting, where training samples are associated with different weights in the loss function. Most of existing re-weighting approaches treat the example weights as the learnable parameter and optimize the weights on the meta set, entailing expensive bilevel optimization. In this paper, we propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view. Specifically, we view the training set as an imbalanced distribution over its samples, which is transported by OT to a balanced distribution obtained from the meta set.


Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

Chen, Yan, Blanchet, Jose, Dembczynski, Krzysztof, Nern, Laura Fee, Flores, Aaron

arXiv.org Machine Learning

Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.


When resampling/reweighting improves feature learning in imbalanced classification?: A toy-model study

Obuchi, Tomoyuki, Tanaka, Toshiyuki

arXiv.org Machine Learning

Classifiers applied to such datasets tend to perform poorly for minority classes, which poses a major challenge in areas such as visual recognition. Although several methods to mitigate class imbalance have been proposed so far [6, 7, 8], recent advances of deep learning have shed new light on this issue, resulting in numerous studies from the perspective of applying those approaches to classifiers based on deep neural networks (DNNs) [5, 9, 10, 11, 12, 13, 1, 2, 14, 15, 16, 17]. Among those approaches proposed so far, we focus on two simple strategies, reweighting and resampling, which are commonly employed to mitigate class imbalance. The resampling strategy tries to balance the samples in the dataset by oversampling the minority classes and/or undersampling the majority classes, while the reweighting strategy puts an additional weight to each term of the loss in order to counterweight the class imbalance. The effectiveness of these strategies has been empirically verified in a wide range of studies [13, 1, 2, 14, 6, 7]. In spite of these pieces of work, transparent description or understanding about when they are useful or not would still be imcomplete. In particular, how class imbalance may affect the quality of feature learning would be an important problem in the context of representation learning in DNNs, but a thorough understanding of this issue is still missing. Recently, [2] reported an interesting observation that feature learning becomes better if no resampling is applied. More specifically, on the basis of their extensive experiment on visual recognition tasks using DNNs, they reported that the best classification performance was achieved when the whole network was first trained without any resampling and then only the last output layer (final classifier) was retrained with class-balanced resampling.


MPOFI: Multichannel Partially Observed Functional Modeling for Defect Classification with Imbalanced Dataset via Deep Metric Learning

Xie, Yukun, Du, Juan, Zhang, Chen

arXiv.org Machine Learning

In modern manufacturing, most of the product lines are conforming. Few products are nonconforming but with different defect types. The identification of defect types can help further root cause diagnosis of production lines. With the sensing development, continuous signals of process variables can be collected in high resolution, which can be regarded as multichannel functional data. They have abundant information to characterize the process and help identify the defect types. Motivated by a real example from the pipe tightening process, we target at detect classification when each sample is a multichannel functional data. However, the available samples for each defect type are limited and imbalanced. Moreover, the functions are partially observed since the pre-tightening process before the pipe tightening process is unobserved. To classify the defect samples based on imbalanced, multichannel, and partially observed functional data is very important but challenging. Thus, we propose an innovative framework known as "Multichannel Partially Observed Functional Modeling for Defect Classification with an Imbalanced Dataset" (MPOFI). The framework leverages the power of deep metric learning in conjunction with a neural network specially crafted for processing functional data. This paper introduces a neural network explicitly tailored for handling multichannel and partially observed functional data, complemented by developing a corresponding loss function for training on imbalanced datasets. The results from a real-world case study demonstrate the superior accuracy of our framework when compared to existing benchmarks.