CAST: Cluster-Aware Self-Training for Tabular Data

Kim, Minwook, Kim, Juseong, Kim, Ki Beom, Song, Giltae

Feb-2-2024–arXiv.org Artificial Intelligence

Self-training has gained attraction because of its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous confidence. Several solutions have been proposed to handle the problem, but they require significant modifications in self-training algorithms or model architecture, and most have limited applicability in tabular domains. To address this issue, we explore a novel direction of reliable confidence in self-training contexts and conclude that the confidence, which represents the value of the pseudo-label, should be aware of the cluster assumption. In this regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances existing self-training algorithms at a negligible cost without significant modifications. Concretely, CAST regularizes the confidence of the classifier by leveraging local density for each class in the labeled training data, forcing the pseudo-labels in low-density regions to have lower confidence. Extensive empirical evaluations on up to 21 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts. Self-training is an iterative algorithm that trains a classifier using a pseudo-labeling procedure, which assigns pseudo-labels to unlabeled data to use as labeled data in each iteration. It is a simple and versatile semi-supervised learning method as it employs the identical training procedure used in supervised learning except for integrating pseudo-labels into the training data. Therefore, it is particularly useful for practitioners in tabular domains, where the dominant architectures are gradient boosting decision trees (GBDTs) which are provided as complete frameworks that do not allow any changes in the training procedure [28; 8; 50]. Contemporary self-training methods consider the confidence, often referred to as prediction probabilities of the classifier, as the score and generate a pseudo-label if the confidence score is higher than or equal to a certain threshold [63; 45]. However, it may not consistently serve as a reliable metric in real-world scenarios for various reasons such as biased classifiers or overconfidence in neural networks [22]. These erroneous confidence scores can lead to the generation of noisy pseudo-labels during the self-training iterations, which may introduce confirmation bias that undermines the final self-training performance [3]. Given these potential pitfalls, relying solely on the confidence may be a precarious choice [72; 47; 64]. Several studies have been conducted to improve erroneous confidence by calibrating the confidence to reflect its ground truth correctness likelihood [22].

calibration, dataset, unlabeled sample, (15 more...)

arXiv.org Artificial Intelligence

Feb-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County > New York City (0.04)
- Asia > Vietnam
  - Hanoi > Hanoi (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine (0.47)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Inductive Learning (0.76)
  - Neural Networks > Deep Learning (0.46)
  - Learning Graphical Models > Directed Networks
    - Bayesian Learning (0.46)