Goto

Collaborating Authors

 leveraging unlabeled data


Anti-causal domain generalization: Leveraging unlabeled data

arXiv.org Machine Learning

The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.


Leveraging Unlabeled Data: A Guide to Semi-Supervised Learning

#artificialintelligence

Semi-supervised learning (SSL) is a machine learning technique that aims to improve the accuracy and efficiency of models by leveraging both labeled and unlabeled data. In this technique, a model is trained using a small amount of labeled data, which is then used to make predictions on a much larger set of unlabeled data. The model then learns from these predictions and adjusts its parameters to improve its accuracy. In traditional supervised learning, a model is trained on a dataset that has both input features and corresponding output labels. The model then uses this labeled data to learn patterns and make predictions on new, unseen data.


Leveraging Unlabeled Data

Communications of the ACM

Despite the rapid advances it has made it over the past decade, deep learning presents many industrial users with problems when they try to implement the technology, issues that the Internet giants have worked around through brute force. "The challenge that today's systems face is the amount of data they need for training," says Tim Ensor, head of artificial intelligence (AI) at U.K.-based technology company Cambridge Consultants. "On top of that, it needs to be structured data." Most of the commercial applications and algorithm benchmarks used to test deep neural networks (DNNs) consume copious quantities of labeled data; for example, images or pieces of text that have already been tagged in some way by a human to indicate what the sample represents. The Internet giants, who have collected the most data for use in training deep learning systems, have often resorted to crowdsourcing measures such as asking people to prove they are human during logins by identifying objects in a collection of images, or simply buying manual labor through services such as Amazon's Mechanical Turk.


Leveraging Unlabeled Data to Scale Blocking for Record Linkage

AAAI Conferences

Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.