Goto

Collaborating Authors

 Unsupervised or Indirectly Supervised Learning


Learning with not Enough Data Part 1: Semi-Supervised Learning

#artificialintelligence

The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled ...


Extending the WILDS Benchmark for Unsupervised Adaptation

arXiv.org Artificial Intelligence

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. To maintain consistency, the labeled training, validation, and test sets, as well as the evaluation metrics, are exactly the same as in the original WILDS benchmark. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS 2.0 is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.


Understanding your performance metrics for clustering

#artificialintelligence

Clustering is categorized under unsupervised learning, which forms the niche part of machine learning. Unlike supervised learning which is more common in most common machine learning study, classification tasks learn from the provided labeled data and makes class predictions. However, this does not cause the clustering method to be less desirable, as clustering algorithms are essential in discovering unexplored insights. Thus, it is important to understand the performance of the clustering task and to decide whether the clusters formed are trustable. Silhouette Analysis is the most common method as it is more straightforward compared to others.


A Note on Machine Learning

#artificialintelligence

Unsupervised learning uses algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns and data groupings without the need of human intenvention. Generally it's used for expolarity data analysis, customer segmentation, recommender systems, big data visualization, feature elicitation etc. Roughly, there are three types of unsupervised learning approach. Clustering is a data mining techinuqe which groups unlabeled data based on similarities and differences. Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures and patterns in the information.


Improving Transferability of Representations via Augmentation-Aware Self-Supervision

arXiv.org Artificial Intelligence

Recent unsupervised representation learning methods have shown to be effective in a range of vision tasks by learning representations invariant to data augmentations such as random cropping and color jittering. However, such invariance could be harmful to downstream tasks if they rely on the characteristics of the data augmentations, e.g., location- or color-sensitive. This is not an issue just for unsupervised learning; we found that this occurs even in supervised learning because it also learns to predict the same label for all augmented samples of an instance. To avoid such failures and obtain more generalizable representations, we suggest to optimize an auxiliary self-supervised loss, coined AugSelf, that learns the difference of augmentation parameters (e.g., cropping positions, color adjustment intensities) between two randomly augmented samples. Our intuition is that AugSelf encourages to preserve augmentation-aware information in learned representations, which could be beneficial for their transferability. Furthermore, AugSelf can easily be incorporated into recent state-of-the-art representation learning methods with a negligible additional training cost. Extensive experiments demonstrate that our simple idea consistently improves the transferability of representations learned by supervised and unsupervised methods in various transfer learning scenarios. The code is available at https://github.com/hankook/AugSelf.


"Generative Adversarial Networks" Science-Research, November 2021, Week 3 -- summary from Arxivโ€ฆ

#artificialintelligence

LDCT has drawn major interest in the clinical imaging field as a result of the potential health and wellness risks of CT-associated X-ray radiation to patients. The benefit of such a U-Net based discriminator is that it can not just supply the per-pixel responses to the denoising network via the outcomes of the U-Net yet also focus on the global framework to a semantic degree through the middle layer of the U-Net. Generative Adversarial Networks have time out of mind changed the world of computer vision and, linked to it, the world of art. In this work, we suggest making use of the latter and show a way to make use of the attributes it has picked up from the training dataset to both change an image and generate one from the ground up. This paper presents a unique multi-fake evolutionary generative adversarial network for taking care of imbalance hyperspectral photo category.


Hybrid BYOL-ViT: Efficient approach to deal with small datasets

arXiv.org Artificial Intelligence

Supervised learning can learn large representational spaces, which are crucial for handling difficult learning tasks. However, due to the design of the model, classical image classification approaches struggle to generalize to new problems and new situations when dealing with small datasets. In fact, supervised learning can lose the location of image features which leads to supervision collapse in very deep architectures. In this paper, we investigate how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network even better than supervised learning, with no need for millions of labeled data. The main goal is to disconnect pixel data from annotation by getting generic task-agnostic low-level features. Furthermore, we look into Vision Transformers (ViT) and show that the low-level features derived from a self-supervised architecture can improve the robustness and the overall performance of this emergent architecture. We evaluated our method on one of the smallest open-source datasets STL-10 and we obtained a significant boost of performance from 41.66% to 83.25% when inputting low-level features from a self-supervised learning architecture to the ViT instead of the raw images.


Query-augmented Active Metric Learning

arXiv.org Machine Learning

In this paper we propose an active metric learning method for clustering with pairwise constraints. The proposed method actively queries the label of informative instance pairs, while estimating underlying metrics by incorporating unlabeled instance pairs, which leads to a more accurate and efficient clustering process. In particular, we augment the queried constraints by generating more pairwise labels to provide additional information in learning a metric to enhance clustering performance. Furthermore, we increase the robustness of metric learning by updating the learned metric sequentially and penalizing the irrelevant features adaptively. In addition, we propose a novel active query strategy that evaluates the information gain of instance pairs more accurately by incorporating the neighborhood structure, which improves clustering efficiency without extra labeling cost. In theory, we provide a tighter error bound of the proposed metric learning method utilizing augmented queries compared with methods using existing constraints only. Furthermore, we also investigate the improvement using the active query strategy instead of random selection. Numerical studies on simulation settings and real datasets indicate that the proposed method is especially advantageous when the signal-to-noise ratio between significant features and irrelevant features is low.


Learning Machine Learning

#artificialintelligence

Machine Learning is a branch of Artificial Intelligence(AI) that is used to predict outcomes of an application without explicitly being programmed to do so. Supervised Learning: It is a type of Machine Learning where the machine is trained with well labeled data. Thus the model is able to predict the price on this well labeled dataset. Unsupervised Learning: It is a type of Machine Learning where the machine is trained to identify patterns and predict outcomes with unlabeled data. Example: If a machine is given a dataset containing the pictures of dolphins and whales (considering the machine has never seen any pictures of dolphins and whales).


Assessing Effectiveness of Using Internal Signals for Check-Worthy Claim Identification in Unlabeled Data for Automated Fact-Checking

arXiv.org Artificial Intelligence

While recent work on automated fact-checking has focused mainly on verifying and explaining claims, for which the list of claims is readily available, identifying check-worthy claim sentences from a text remains challenging. Current claim identification models rely on manual annotations for each sentence in the text, which is an expensive task and challenging to conduct on a frequent basis across multiple domains. This paper explores methodology to identify check-worthy claim sentences from fake news articles, irrespective of domain, without explicit sentence-level annotations. We leverage two internal supervisory signals - headline and the abstractive summary - to rank the sentences based on semantic similarity. We hypothesize that this ranking directly correlates to the check-worthiness of the sentences. To assess the effectiveness of this hypothesis, we build pipelines that leverage the ranking of sentences based on either the headline or the abstractive summary. The top-ranked sentences are used for the downstream fact-checking tasks of evidence retrieval and the article's veracity prediction by the pipeline. Our findings suggest that the top 3 ranked sentences contain enough information for evidence-based fact-checking of a fake news article. We also show that while the headline has more gisting similarity with how a fact-checking website writes a claim, the summary-based pipeline is the most promising for an end-to-end fact-checking system.