Goto

Collaborating Authors

 Unsupervised or Indirectly Supervised Learning


Rebalancing the Scales: A Systematic Mapping Study of Generative Adversarial Networks (GANs) in Addressing Data Imbalance

arXiv.org Artificial Intelligence

Machine learning algorithms are used in diverse domains, many of which face significant challenges due to data imbalance. Studies have explored various approaches to address the issue, like data preprocessing, cost-sensitive learning, and ensemble methods. Generative Adversarial Networks (GANs) showed immense potential as a data preprocessing technique that generates good quality synthetic data. This study employs a systematic mapping methodology to analyze 3041 papers on GAN-based sampling techniques for imbalanced data sourced from four digital libraries. A filtering process identified 100 key studies spanning domains such as healthcare, finance, and cybersecurity. Through comprehensive quantitative analysis, this research introduces three categorization mappings as application domains, GAN techniques, and GAN variants used to handle the imbalanced nature of the data. GAN-based over-sampling emerges as an effective preprocessing method. Advanced architectures and tailored frameworks helped GANs to improve further in the case of data imbalance. GAN variants like vanilla GAN, CTGAN, and CGAN show great adaptability in structured imbalanced data cases. Interest in GANs for imbalanced data has grown tremendously, touching a peak in recent years, with journals and conferences playing crucial roles in transmitting foundational theories and practical applications. While with these advances, none of the reviewed studies explicitly explore hybridized GAN frameworks with diffusion models or reinforcement learning techniques. This gap leads to a future research idea develop innovative approaches for effectively handling data imbalance.


Context-Aware Doubly-Robust Semi-Supervised Learning

arXiv.org Artificial Intelligence

--The widespread adoption of artificial intelligence (AI) in next-generation communication systems is challenged by the heterogeneity of traffic and network conditions, which call for the use of highly contextual, site-specific, data. A promising solution is to rely not only on real-world data, but also on synthetic pseudo-data generated by a network digital twin (NDT). However, the effectiveness of this approach hinges on the accuracy of the NDT, which can vary widely across different contexts. T o address this problem, this paper introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised scheme that adapts its reliance on the pseudo-data to the different levels of fidelity of the NDT across contexts. CDR is evaluated on the task of downlink beamforming, showing superior performance compared to previous state-of-the-art semi-supervised approaches.


Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation

arXiv.org Artificial Intelligence

Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilising unlabeled data alongside limited labeled samples. Existing SSSS methods often face challenges such as coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by insufficient boundary-awareness and ambiguous edge information. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS) remain under-explored in SSSS. Specifically, our approach: (1) reduces coupling through a confidence-weighted loss function that adjusts the influence of pseudo-labels based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) resolves boundary blur with a boundary-aware module that enhances segmentation accuracy near object boundaries, and (4) reduces label noise with a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on the Pascal VOC 2012 and Cityscapes demonstrate that our method achieves state-of-the-art performance. Moreover, using only 1/8 or 12.5\% of labeled data, our method achieves a mIoU of 75.81 on Pascal VOC 2012, highlighting its effectiveness in limited-label settings.


Enhancing Semi-supervised Learning with Noisy Zero-shot Pseudolabels

arXiv.org Artificial Intelligence

The growing scale of machine learning applications has made data labeling costs a critical bottleneck in deploying ML systems [1, 2, 3]. Semi-supervised learning (SSL) addresses this challenge by leveraging unlabeled data alongside limited labeled examples [4]. Traditional SSL approaches like pseudo-labeling and consistency regularization have demonstrated strong performance across domains, particularly in computer vision and natural language processing [5, 6, 4]. Recent advances in foundation models have enabled zero-shot inference on novel tasks without taskspecific training [7, 8]. These models can generate predictions for unseen tasks by leveraging their pretrained knowledge, offering a promising direction for reducing labeling requirements. Several works have proposed integrating these zero-shot capabilities into SSL frameworks [9, 10]. Current approaches primarily use foundation models as teacher networks for generating pseudo-labels through inference, which requires complex model distillation and introduces additional training overhead.


Proper Learnability and the Role of Unlabeled Data

arXiv.org Machine Learning

Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.


Supplementary Material: Learning Semantic-aware Normalization for Generative Adversarial Networks

Neural Information Processing Systems

It can be observed that features with low resolutions (e.g., 8 8 64 64) Figure 2 shows the semantic interpolation results. Table 1: Comparison of baseline, random grouping and semantic grouping (i.e., the proposed SGM) Table 2: Conduct semantic-aware control at different on LSUN CATS [26] in terms of FID. Figure 1: Visualization of the semantics learned in different resolutions. We show 16 groups in each layer with the resolution increasing from 8 8 to 256 256. The attention maps are obtained by averaging the feature maps in a group. It can be observed that features with low resolutions (i.e., 8 8 64 64) show better performance in learning semantics (e.g., eyes, mouths and hair). We can realize independent control on fine-grained semantics by conducting interpolation in latent space.


Robust Graph-Based Semi-Supervised Learning via $p$-Conductances

arXiv.org Artificial Intelligence

We study the problem of semi-supervised learning on graphs in the regime where data labels are scarce or possibly corrupted. We propose an approach called $p$-conductance learning that generalizes the $p$-Laplace and Poisson learning methods by introducing an objective reminiscent of $p$-Laplacian regularization and an affine relaxation of the label constraints. This leads to a family of probability measure mincut programs that balance sparse edge removal with accurate distribution separation. Our theoretical analysis connects these programs to well-known variational and probabilistic problems on graphs (including randomized cuts, effective resistance, and Wasserstein distance) and provides motivation for robustness when labels are diffused via the heat kernel. Computationally, we develop a semismooth Newton-conjugate gradient algorithm and extend it to incorporate class-size estimates when converting the continuous solutions into label assignments. Empirical results on computer vision and citation datasets demonstrate that our approach achieves state-of-the-art accuracy in low label-rate, corrupted-label, and partial-label regimes.


Review for NeurIPS paper: Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks

Neural Information Processing Systems

The paper studies the problem of estimating the effect of continuous treatment variables. The authors propose a GAN-based framework to learns the distribution of the unobserved counterfactuals. The reviewers found the theoretical contribution as well as the simulation showing improvement over the pre-existing benchmarks satisfying. Estimating the effect of a treatment is a central problem to causal inference and as such this paper could be of interest to the broader NeurIPS audience.


065e259a1d2d955e63b99aac6a3a3081-Paper-Conference.pdf

Neural Information Processing Systems

In the adversarial training framework of Carmon et al. (2019); Gowal et al. (2021), people use generated/real unlabeled data with pseudolabels to improve adversarial robustness. We provide statistical insights to explain why the artificially generated data improve adversarial training. In particular, we study how the attack strength and the quality of the unlabeled data affect adversarial robustness in this framework. Our results show that with a high-quality unlabeled data generator, adversarial training can benefit greatly from this framework under large attack strength, while a poor generator can still help to some extent. To make adaptions concerning the quality of generated data, we propose an algorithm that performs online adjustment to the weight between the labeled real data and the generated data, aiming to optimize the adversarial risk. Numerical studies are conducted to verify our theories and show the effectiveness of the proposed algorithm.


Review for NeurIPS paper: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Neural Information Processing Systems

Four knowledgeable reviewers support acceptance for the contributions. Reviewers find that i) the proposed algorithm is simple; ii) efficient and empirical evaluation is very carefully designed with an extensive ablation study; iii) analysis on augmentation strategy and sharpening also provide good insights. Therefore, I also recommend acceptance. However, please consider revising your paper to address all the concerns and comments from the reviewers.