Goto

Collaborating Authors

 pseudo-labeling


TransPL: VQ-Code Transition Matrices for Pseudo-Labeling of Time Series Unsupervised Domain Adaptation

Kim, Jaeho, Lee, Seulki

arXiv.org Artificial Intelligence

Unsupervised domain adaptation (UDA) for time series data remains a critical challenge in deep learning, with traditional pseudo-labeling strategies failing to capture temporal patterns and channel-wise shifts between domains, producing sub-optimal pseudo-labels. As such, we introduce TransPL, a novel approach that addresses these limitations by modeling the joint distribution $P(\mathbf{X}, y)$ of the source domain through code transition matrices, where the codes are derived from vector quantization (VQ) of time series patches. Our method constructs class- and channel-wise code transition matrices from the source domain and employs Bayes' rule for target domain adaptation, generating pseudo-labels based on channel-wise weighted class-conditional likelihoods. TransPL offers three key advantages: explicit modeling of temporal transitions and channel-wise shifts between different domains, versatility towards different UDA scenarios (e.g., weakly-supervised UDA), and explainable pseudo-label generation. We validate TransPL's effectiveness through extensive analysis on four time series UDA benchmarks and confirm that it consistently outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1% accuracy improvement, 4.9% F1 improvement), while providing interpretable insights into the domain adaptation process through its learned code transition matrices.


Achieving Generalization in Orchestrating GNSS Interference Monitoring Stations Through Pseudo-Labeling

Heublein, Lucas, Feigl, Tobias, Rügamer, Alexander, Ott, Felix

arXiv.org Artificial Intelligence

The accuracy of global navigation satellite system (GNSS) receivers is significantly compromised by interference from jamming devices. Consequently, the detection of these jammers are crucial to mitigating such interference signals. However, robust classification of interference using machine learning (ML) models is challenging due to the lack of labeled data in real-world environments. In this paper, we propose an ML approach that achieves high generalization in classifying interference through orchestrated monitoring stations deployed along highways. We present a semi-supervised approach coupled with an uncertainty-based voting mechanism by combining Monte Carlo and Deep Ensembles that effectively minimizes the requirement for labeled training samples to less than 5% of the dataset while improving adaptability across varying environments. Our method demonstrates strong performance when adapted from indoor environments to real-world scenarios.


A Review of Pseudo-Labeling for Computer Vision

Kage, Patrick, Rothenberger, Jay C., Andreadis, Pavlos, Diochnos, Dimitrios I.

arXiv.org Artificial Intelligence

Deep neural models have achieved state of the art performance on a wide range of problems in computer science, especially in computer vision. However, deep neural networks often require large datasets of labeled samples to generalize effectively, and an important area of active research is semi-supervised learning, which attempts to instead utilize large quantities of (easily acquired) unlabeled samples. One family of methods in this space is pseudo-labeling, a class of algorithms that use model outputs to assign labels to unlabeled samples which are then used as labeled samples during training. Such assigned labels, called pseudo-labels, are most commonly associated with the field of semi-supervised learning. In this work we explore a broader interpretation of pseudo-labels within both self-supervised and unsupervised methods. By drawing the connection between these areas we identify new directions when advancements in one area would likely benefit others, such as curriculum learning and self-supervised regularization.


CRMSP: A Semi-supervised Approach for Key Information Extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling

Zhang, Qi, Song, Yonghong, Guo, Pengcheng, Hui, Yangyang

arXiv.org Artificial Intelligence

There is a growing demand in the field of KIE (Key Information Extraction) to apply semi-supervised learning to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.


Unsupervised ASR via Cross-Lingual Pseudo-Labeling

Likhomanenko, Tatiana, Lugosch, Loren, Collobert, Ronan

arXiv.org Artificial Intelligence

Recent work has shown that it is possible to train an unsupervised automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is always labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an unsupervised AM in a new language. Here, "unsupervised" means no labeled audio is available for the target language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the target language using some other language AM and (ii) constraining these PLs with a target language model. Our approach is effective on Common Voice: e.g. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data. Spanish, es) and generating pseudo-labels using a language model for the desired target language (e.g. English), we can train an unsupervised speech recognition system for the target language using iterative pseudo-labeling.


Confident Sinkhorn Allocation for Pseudo-Labeling

Nguyen, Vu, Husain, Hisham, Farfade, Sachin, Hengel, Anton van den

arXiv.org Artificial Intelligence

Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. However, PL is sensitive to a threshold and can perform poorly if wrong assignments are made due to overconfidence. This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PAC-Bayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.


Embarrassingly Simple MixUp for Time-series

Aggarwal, Karan, Srivastava, Jaideep

arXiv.org Artificial Intelligence

Labeling time series data is an expensive task because of domain expertise and dynamic nature of the data. Hence, we often have to deal with limited labeled data settings. Data augmentation techniques have been successfully deployed in domains like computer vision to exploit the use of existing labeled data. We adapt one of the most commonly used technique called MixUp, in the time series domain. Our proposed, MixUp++ and LatentMixUp++, use simple modifications to perform interpolation in raw time series and classification model's latent space, respectively. We also extend these methods with semi-supervised learning to exploit unlabeled data. We observe significant improvements of 1\% - 15\% on time series classification on two public datasets, for both low labeled data as well as high labeled data regimes, with LatentMixUp++.


On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning

Han, Lu, Ye, Han-Jia, Zhan, De-Chuan

arXiv.org Artificial Intelligence

When there are unlabeled Out-Of-Distribution (OOD) data from other classes, Semi-Supervised Learning (SSL) methods suffer from severe performance degradation and even get worse than merely training on labeled data. In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data according to the model's prediction. We aim to answer two main questions: (1) How do OOD data influence PL? (2) What is the proper usage of OOD data with PL? First, we show that the major problem of PL is imbalanced pseudo-labels on OOD data. Second, we find that OOD data can help classify In-Distribution (ID) data given their OOD ground truth labels. Based on the findings, we propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels of high-confidence data, which simultaneously filters out OOD data and addresses the imbalance problem. SEC uses balanced clustering on low-confidence data to create pseudo-labels on extra classes, simulating the process of training with ground truth. Experiments show that our method achieves steady improvement over supervised baseline and state-of-the-art performance under all class mismatch ratios on different benchmarks.


Pseudo-labeling a simple semi-supervised learning method - Data, what now?

@machinelearnbot

The foundation of every machine learning project is data – the one thing you cannot do without. In this post, I will show how a simple semi-supervised learning method called pseudo-labeling that can increase the performance of your favorite machine learning models by utilizing unlabeled data. To train a machine learning model with supervised learning, the data has to be labeled. Does that mean that unlabeled data is useless for supervised tasks like classification and regression? Aside from using the extra data for analytic purposes, we can even use it to help train our model with semi-supervised learning – combining both unlabeled and labeled data for model training.