Goto

Collaborating Authors

 Unsupervised or Indirectly Supervised Learning


Sample-Optimal Agnostic Boosting with Unlabeled Data

arXiv.org Machine Learning

Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. This paper highlights an unexpected and previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We detail other applications of this result in reinforcement learning.


Machine Learning in Biomechanics: Key Applications and Limitations in Walking, Running, and Sports Movements

arXiv.org Artificial Intelligence

This chapter provides an overview of recent and promising Machine Learning applications, i.e. pose estimation, feature estimation, event detection, data exploration & clustering, and automated classification, in gait (walking and running) and sports biomechanics. It explores the potential of Machine Learning methods to address challenges in biomechanical workflows, highlights central limitations, i.e. data and annotation availability and explainability, that need to be addressed, and emphasises the importance of interdisciplinary approaches for fully harnessing the potential of Machine Learning in gait and sports biomechanics.


Confidence HNC: A Network Flow Technique for Binary Classification with Noisy Labels

arXiv.org Artificial Intelligence

The performance of machine learning models depends to a great extent on the data quality and, in particular, the reliability of the labels. Label noise is one of the concerning issues that has a tremendous impact on the outcome of learning methods and receives attention from researchers in the community. Among different classes of learning methods, semi-supervised learning is a class of methods that utilize information from unlabeled data in addition to the labeled data, and they are often used in the context where labeled data is scarce or costly [Zhu and Goldberg, 2009]. By counterbalancing the effect of possibly noisy labeled data with information from unlabeled data, these methods also have the potential of mitigating the issue of label noise, on top of its advantage in the scenario where labeled samples are given in a limited amount. A particular class of semi-supervised methods that we are interested in is the class of network-flow based, or graph based, methods in which minimum cut solution of a graph representation of the data provides label prediction of unlabeled samples. Unlabeled samples assist the method through their connectivity with labeled samples, as well as that among themselves.


Re-Evaluating the Impact of Unseen-Class Unlabeled Data on Semi-Supervised Learning Model

arXiv.org Artificial Intelligence

Semi-supervised learning (SSL) effectively leverages unlabeled data and has been proven successful across various fields. Current safe SSL methods believe that unseen classes in unlabeled data harm the performance of SSL models. However, previous methods for assessing the impact of unseen classes on SSL model performance are flawed. They fix the size of the unlabeled dataset and adjust the proportion of unseen classes within the unlabeled data to assess the impact. This process contravenes the principle of controlling variables. Adjusting the proportion of unseen classes in unlabeled data alters the proportion of seen classes, meaning the decreased classification performance of seen classes may not be due to an increase in unseen class samples in the unlabeled data, but rather a decrease in seen class samples. Thus, the prior flawed assessment standard that ``unseen classes in unlabeled data can damage SSL model performance" may not always hold true. This paper strictly adheres to the principle of controlling variables, maintaining the proportion of seen classes in unlabeled data while only changing the unseen classes across five critical dimensions, to investigate their impact on SSL models from global robustness and local robustness. Experiments demonstrate that unseen classes in unlabeled data do not necessarily impair the performance of SSL models; in fact, under certain conditions, unseen classes may even enhance them.


A Unified Framework for Heterogeneous Semi-supervised Learning

arXiv.org Artificial Intelligence

In this work, we introduce a novel problem setup termed as Heterogeneous Semi-Supervised Learning (HSSL), which presents unique challenges by bridging the semi-supervised learning (SSL) task and the unsupervised domain adaptation (UDA) task, and expanding standard semi-supervised learning to cope with heterogeneous training data. At its core, HSSL aims to learn a prediction model using a combination of labeled and unlabeled training data drawn separately from heterogeneous domains that share a common set of semantic categories; this model is intended to differentiate the semantic categories of test instances sampled from both the labeled and unlabeled domains. In particular, the labeled and unlabeled domains have dissimilar label distributions and class feature distributions. This heterogeneity, coupled with the assorted sources of the test data, introduces significant challenges to standard SSL and UDA methods. Therefore, we propose a novel method, Unified Framework for Heterogeneous Semi-supervised Learning (Uni-HSSL), to address HSSL by directly learning a fine-grained classifier from the heterogeneous data, which adaptively handles the inter-domain heterogeneity while leveraging both the unlabeled data and the inter-domain semantic class relationships for cross-domain knowledge transfer and adaptation. We conduct comprehensive experiments and the experimental results validate the efficacy and superior performance of the proposed Uni-HSSL over state-of-the-art semi-supervised learning and unsupervised domain adaptation methods.


CuPID: Leveraging Masked Single-Lead ECG Modelling for Enhancing the Representations

arXiv.org Artificial Intelligence

Wearable sensing devices, such as Electrocardiogram (ECG) heart-rate monitors, will play a crucial role in the future of digital health. This continuous monitoring leads to massive unlabeled data, incentivizing the development of unsupervised learning frameworks. While Masked Data Modelling (MDM) techniques have enjoyed wide use, their direct application to single-lead ECG data is suboptimal due to the decoder's difficulty handling irregular heartbeat intervals when no contextual information is provided. In this paper, we present Cueing the Predictor Increments the Detailing (CuPID), a novel MDM method tailored to single-lead ECGs. CuPID enhances existing MDM techniques by cueing spectrogram-derived context to the decoder, thus incentivizing the encoder to produce more detailed representations. This has a significant impact on the encoder's performance across a wide range of different configurations, leading CuPID to outperform state-of-the-art methods in a variety of downstream tasks.


TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency

arXiv.org Artificial Intelligence

Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at https://github.com/HenryPengZou/TestNUC.


Rebalancing the Scales: A Systematic Mapping Study of Generative Adversarial Networks (GANs) in Addressing Data Imbalance

arXiv.org Artificial Intelligence

Machine learning algorithms are used in diverse domains, many of which face significant challenges due to data imbalance. Studies have explored various approaches to address the issue, like data preprocessing, cost-sensitive learning, and ensemble methods. Generative Adversarial Networks (GANs) showed immense potential as a data preprocessing technique that generates good quality synthetic data. This study employs a systematic mapping methodology to analyze 3041 papers on GAN-based sampling techniques for imbalanced data sourced from four digital libraries. A filtering process identified 100 key studies spanning domains such as healthcare, finance, and cybersecurity. Through comprehensive quantitative analysis, this research introduces three categorization mappings as application domains, GAN techniques, and GAN variants used to handle the imbalanced nature of the data. GAN-based over-sampling emerges as an effective preprocessing method. Advanced architectures and tailored frameworks helped GANs to improve further in the case of data imbalance. GAN variants like vanilla GAN, CTGAN, and CGAN show great adaptability in structured imbalanced data cases. Interest in GANs for imbalanced data has grown tremendously, touching a peak in recent years, with journals and conferences playing crucial roles in transmitting foundational theories and practical applications. While with these advances, none of the reviewed studies explicitly explore hybridized GAN frameworks with diffusion models or reinforcement learning techniques. This gap leads to a future research idea develop innovative approaches for effectively handling data imbalance.


Context-Aware Doubly-Robust Semi-Supervised Learning

arXiv.org Artificial Intelligence

--The widespread adoption of artificial intelligence (AI) in next-generation communication systems is challenged by the heterogeneity of traffic and network conditions, which call for the use of highly contextual, site-specific, data. A promising solution is to rely not only on real-world data, but also on synthetic pseudo-data generated by a network digital twin (NDT). However, the effectiveness of this approach hinges on the accuracy of the NDT, which can vary widely across different contexts. T o address this problem, this paper introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised scheme that adapts its reliance on the pseudo-data to the different levels of fidelity of the NDT across contexts. CDR is evaluated on the task of downlink beamforming, showing superior performance compared to previous state-of-the-art semi-supervised approaches.


Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation

arXiv.org Artificial Intelligence

Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilising unlabeled data alongside limited labeled samples. Existing SSSS methods often face challenges such as coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by insufficient boundary-awareness and ambiguous edge information. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS) remain under-explored in SSSS. Specifically, our approach: (1) reduces coupling through a confidence-weighted loss function that adjusts the influence of pseudo-labels based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) resolves boundary blur with a boundary-aware module that enhances segmentation accuracy near object boundaries, and (4) reduces label noise with a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on the Pascal VOC 2012 and Cityscapes demonstrate that our method achieves state-of-the-art performance. Moreover, using only 1/8 or 12.5\% of labeled data, our method achieves a mIoU of 75.81 on Pascal VOC 2012, highlighting its effectiveness in limited-label settings.