Goto

Collaborating Authors

 probabilistic label




DP-SSL: TowardsRobustSemi-supervisedLearning withAFewLabeledSamples

Neural Information Processing Systems

However, when the size of labeled data is very small (say a few labeled samples per class), SSL performs poorly and unstably, possibly due to the low qualityoflearnedpseudolabels.Inthispaper,weproposeanewSSLmethodcalled DP-SSL that adopts an innovative data programming (DP) scheme to generate probabilistic labels for unlabeled data. Different from existing DP methods that rely on human experts to provide initial labeling functions (LFs), we develop a multiple-choice learning (MCL) based approach to automatically generate LFs fromscratchinSSLstyle. Withthenoisylabelsproduced bytheLFs,wedesign a label model to resolve the conflict and overlap among the noisy labels, and finally infer probabilistic labels for unlabeled samples.


DP-SSL: Towards Robust Semi-supervised Learning with A Few Labeled Samples

Neural Information Processing Systems

The scarcity of labeled data is a critical obstacle to deep learning. Semi-supervised learning (SSL) provides a promising way to leverage unlabeled data by pseudo labels. However, when the size of labeled data is very small (say a few labeled samples per class), SSL performs poorly and unstably, possibly due to the low quality of learned pseudo labels. In this paper, we propose a new SSL method called DP-SSL that adopts an innovative data programming (DP) scheme to generate probabilistic labels for unlabeled data. Different from existing DP methods that rely on human experts to provide initial labeling functions (LFs), we develop a multiple-choice learning~(MCL) based approach to automatically generate LFs from scratch in SSL style. With the noisy labels produced by the LFs, we design a label model to resolve the conflict and overlap among the noisy labels, and finally infer probabilistic labels for unlabeled samples. Extensive experiments on four standard SSL benchmarks show that DP-SSL can provide reliable labels for unlabeled data and achieve better classification performance on test sets than existing SSL methods, especially when only a small number of labeled samples are available. Concretely, for CIFAR-10 with only 40 labeled samples, DP-SSL achieves 93.82% annotation accuracy on unlabeled data and 93.46% classification accuracy on test data, which are higher than the SOTA results.



Clinical Uncertainty Impacts Machine Learning Evaluations

Lionetti, Simone, Gröger, Fabian, Gottfrois, Philippe, Gonzalez-Jimenez, Alvaro, Amruthalingam, Ludovic, Navarini, Alexander A., Pouly, Marc

arXiv.org Artificial Intelligence

Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations' generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.




Judging with Confidence: Calibrating Autoraters to Preference Distributions

Li, Zhuohang, Li, Xiaowei, Huang, Chengyu, Li, Guowang, Goshvadi, Katayoon, Dai, Bo, Schuurmans, Dale, Zhou, Paul, Palangi, Hamid, Song, Yiwen, Goyal, Palash, Kantarcioglu, Murat, Malin, Bradley A., Xue, Yuan

arXiv.org Artificial Intelligence

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.