Goto

Collaborating Authors

 Unsupervised or Indirectly Supervised Learning


GeRA: Label-Efficient Geometrically Regularized Alignment

arXiv.org Artificial Intelligence

Pretrained unimodal encoders incorporate rich semantic information into embedding space structures. To be similarly informative, multi-modal encoders typically require massive amounts of paired data for alignment and training. We introduce a semi-supervised Geometrically Regularized Alignment (GeRA) method to align the embedding spaces of pretrained unimodal encoders in a label-efficient way. Our method leverages the manifold geometry of unpaired (unlabeled) data to improve alignment performance. To prevent distortions to local geometry during the alignment process --potentially disrupting semantic neighborhood structures and causing misalignment of unobserved pairs -- we introduce a geometric loss term. This term is built upon a diffusion operator that captures the local manifold geometry of the unimodal pretrained encoders. GeRA is modality-agnostic and thus can be used to align pretrained encoders from any data modalities. We provide empirical evidence to the effectiveness of our method in the domains of speech-text and image-text alignment. Our experiments demonstrate significant improvement in alignment quality compared to a variaty of leading baselines, especially with a small amount of paired data, using our proposed geometric regularization.


Wafer Map Defect Patterns Semi-Supervised Classification Using Latent Vector Representation

arXiv.org Artificial Intelligence

As the globalization of semiconductor design and manufacturing processes continues, the demand for defect detection during integrated circuit fabrication stages is becoming increasingly critical, playing a significant role in enhancing the yield of semiconductor products. Traditional wafer map defect pattern detection methods involve manual inspection using electron microscopes to collect sample images, which are then assessed by experts for defects. This approach is labor-intensive and inefficient. Consequently, there is a pressing need to develop a model capable of automatically detecting defects as an alternative to manual operations. In this paper, we propose a method that initially employs a pre-trained VAE model to obtain the fault distribution information of the wafer map. This information serves as guidance, combined with the original image set for semi-supervised model training. During the semi-supervised training, we utilize a teacher-student network for iterative learning. The model presented in this paper is validated on the benchmark dataset WM-811K wafer dataset. The experimental results demonstrate superior classification accuracy and detection performance compared to state-of-the-art models, fulfilling the requirements for industrial applications. Compared to the original architecture, we have achieved significant performance improvement.


Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends

arXiv.org Artificial Intelligence

Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textit{resampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns.} Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to $11.3\%$ in key metrics. The code is available at \href{https://github.com/wxr99/HolisticPU}{https://github.com/wxr99/HolisticPU}.


SemiReward: A General Reward Model for Semi-supervised Learning

arXiv.org Artificial Intelligence

Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks of three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch.


Something for (almost) nothing: Improving deep ensemble calibration using unlabeled data

arXiv.org Artificial Intelligence

We present a method to improve the calibration of deep ensembles in the small training data regime in the presence of unlabeled data. Our approach is extremely simple to implement: given an unlabeled set, for each unlabeled data point, we simply fit a different randomly selected label with each ensemble member. We provide a theoretical analysis based on a PAC-Bayes bound which guarantees that if we fit such a labeling on unlabeled data, and the true labels on the training data, we obtain low negative log-likelihood and high ensemble diversity on testing samples. Empirically, through detailed experiments, we find that for low to moderately-sized training sets, our ensembles are more diverse and provide better calibration than standard ensembles, sometimes significantly.


Pool-Based Active Learning with Proper Topological Regions

arXiv.org Artificial Intelligence

In recent years, machine learning has found gainful applications in diverse domains, but it still has a heavy dependence on expensive labeled data: Advances in cheap computing and storage have made it easier to store and process large amounts of unlabeled data, but the labeling often needs to be done by humans or using costly tools. Therefore, there is a need to develop general domain-independent methods to learn models effectively from a large amount of unlabeled data at the disposal, along with a minimal amount of labeled data: this is the framework of semi-supervised learning. Active learning aims explicitly to detect the observations to be labeled to optimize the learning process and efficiently reduce the labeling cost. The primary assumption behind active learning is that machine learning algorithms could reach a higher level of performance while using a smaller number of training labels if they were allowed to choose the training dataset (Settles, 2009). The most common active learning approaches are pool-based methods (Lewis and Catlett, 1994) based on a set of unlabeled observations. First, some points are labeled to train a classification model, and then, at each iteration, we choose unlabeled examples to query based on the predictions of the current model and a predefined priority score. These approaches show their limitations in low-budget regime scenarios because they need a sufficient budget to learn a weak model (Pourahmadi et al., 2021). 1


Iterative Semi-Supervised Learning for Abdominal Organs and Tumor Segmentation

arXiv.org Artificial Intelligence

Deep-learning (DL) based methods are playing an important role in the task of abdominal organs and tumors segmentation in CT scans. However, the large requirements of annotated datasets heavily limit its development. The FLARE23 challenge provides a large-scale dataset with both partially and fully annotated data, which also focuses on both segmentation accuracy and computational efficiency. In this study, we propose to use the strategy of Semi-Supervised Learning (SSL) and iterative pseudo labeling to address FLARE23. Initially, a deep model (nn-UNet) trained on datasets with complete organ annotations (about 220 scans) generates pseudo labels for the whole dataset. These pseudo labels are then employed to train a more powerful segmentation model. Employing the FLARE23 dataset, our approach achieves an average DSC score of 89.63% for organs and 46.07% for tumors on online validation leaderboard. For organ segmentation, We obtain 0.9007\% DSC and 0.9493\% NSD. For tumor segmentation, we obtain 0.3785% DSC and 0.2842% NSD. Our code is available at https://github.com/USTguy/Flare23.


TiDAL: Learning Training Dynamics for Active Learning

arXiv.org Artificial Intelligence

Active learning (AL) aims to select the most useful data samples from an unlabeled data pool and annotate them to expand the labeled dataset under a limited budget. Especially, uncertainty-based methods choose the most uncertain samples, which are known to be effective in improving model performance. However, AL literature often overlooks training dynamics (TD), defined as the ever-changing model behavior during optimization via stochastic gradient descent, even though other areas of literature have empirically shown that TD provides important clues for measuring the sample uncertainty. In this paper, we propose a novel AL method, Training Dynamics for Active Learning (TiDAL), which leverages the TD to quantify uncertainties of unlabeled data. Since tracking the TD of all the large-scale unlabeled data is impractical, TiDAL utilizes an additional prediction module that learns the TD of labeled data. To further justify the design of TiDAL, we provide theoretical and empirical evidence to argue the usefulness of leveraging TD for AL. Experimental results show that our TiDAL achieves better or comparable performance on both balanced and imbalanced benchmark datasets compared to state-of-the-art AL methods, which estimate data uncertainty using only static information after model training.


Generative Semi-supervised Learning with Meta-Optimized Synthetic Samples

arXiv.org Machine Learning

Semi-supervised learning (SSL) is a promising approach for training deep classification models using labeled and unlabeled datasets. However, existing SSL methods rely on a large unlabeled dataset, which may not always be available in many real-world applications due to legal constraints (e.g., GDPR). In this paper, we investigate the research question: Can we train SSL models without real unlabeled datasets? Instead of using real unlabeled datasets, we propose an SSL method using synthetic datasets generated from generative foundation models trained on datasets containing millions of samples in diverse domains (e.g., ImageNet). Our main concepts are identifying synthetic samples that emulate unlabeled samples from generative foundation models and training classifiers using these synthetic samples. To achieve this, our method is formulated as an alternating optimization problem: (i) meta-learning of generative foundation models and (ii) SSL of classifiers using real labeled and synthetic unlabeled samples. For (i), we propose a meta-learning objective that optimizes latent variables to generate samples that resemble real labeled samples and minimize the validation loss. For (ii), we propose a simple unsupervised loss function that regularizes the feature extractors of classifiers to maximize the performance improvement obtained from synthetic samples. We confirm that our method outperforms baselines using generative foundation models on SSL. We also demonstrate that our methods outperform SSL using real unlabeled datasets in scenarios with extremely small amounts of labeled datasets. This suggests that synthetic samples have the potential to provide improvement gains more efficiently than real unlabeled data.


Understanding the Structure of QM7b and QM9 Quantum Mechanical Datasets Using Unsupervised Learning

arXiv.org Artificial Intelligence

This paper explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from the properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner core region that concentrates clustered, inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inner nature. Despite the structural differences, the predictability of variables of interest for inverse molecular design is high. This is exemplified with models estimating the number of atoms of the molecule from both the original properties, and from lower dimensional embedding spaces.