Unsupervised or Indirectly Supervised Learning
Negative sampling in semi-supervised learning
Chen, John, Shah, Vatsal, Kyrillidis, Anastasios
We introduce Negative Sampling in Semi-Supervised Learning (NS3L), a simple, fast, easy to tune algorithm for semi-supervised learning (SSL). NS3L is motivated by the success of negative sampling/contrastive estimation. We demonstrate that adding the NS3L loss to state-of-the-art SSL algorithms, such as the Virtual Adversarial Training (VAT), significantly improves upon vanilla VAT and its variant, VAT with Entropy Minimization. By adding the NS3L loss to MixMatch, the current state-of-the-art approach on semi-supervised tasks, we observe significant improvements over vanilla MixMatch. We conduct extensive experiments on the CIFAR10, CIFAR100, SVHN and STL10 benchmark datasets.
Learning from a Teacher using Unlabeled Data
Menghani, Gaurav, Ravi, Sujith
Knowledge distillation is a widely used technique for model compression. We posit that the teacher model used in a distillation setup, captures relationships between classes, that extend beyond the original dataset. We empirically show that a teacher model can transfer this knowledge to a student model even on an {\it out-of-distribution} dataset. Using this approach, we show promising results on MNIST, CIFAR-10, and Caltech-256 datasets using unlabeled image data from different sources. Our results are encouraging and help shed further light from the perspective of understanding knowledge distillation and utilizing unlabeled data to improve model quality.
Semi-Supervised Method using Gaussian Random Fields for Boilerplate Removal in Web Browsers
Boilerplate removal refers to the problem of removing noisy content from a webpage such as ads and extracting relevant content that can be used by various services. This can be useful in several features in web browsers such as ad blocking, accessibility tools such as read out loud, translation, summarization etc. In order to create a training dataset to train a model for boilerplate detection and removal, labeling or tagging webpage data manually can be tedious and time consuming. Hence, a semi-supervised model, in which some of the webpage elements are labeled manually and labels for others are inferred based on some parameters, can be useful. In this paper we present a solution for extraction of relevant content from a webpage that relies on semi-supervised learning using Gaussian Random Fields. We first represent the webpage as a graph, with text elements as nodes and the edge weights representing similarity between nodes. After this, we label a few nodes in the graph using heuristics and label the remaining nodes by a weighted measure of similarity to the already labeled nodes. We describe the system architecture and a few preliminary results on a dataset of webpages.
Machine Learning: What it is and Why it Matters
Machine Learning has begun to reshape how we live, so we need to understand what Machine Learning is and know why it matters. A good start at a Machine Learning definition is that it is a core sub-area of Artificial Intelligence (AI). ML applications learn from experience (well data) like humans without direct programming. When exposed to new data, these applications learn, grow, change, and develop by themselves. In other words, with Machine Learning, computers find insightful information without being told where to look.
5 Types of Machine Learning Algorithms You Should Know
If you're a beginner, machine learning can be confusing for you– how to choose which algorithms to use, from the apparently limitless options, and how to know which one will provide the right predictions (data outputs). The machine learning is a way for computers to run various algorithms without direct human oversight in order to learn from data. So, just before starting with Machine learning algorithms, let's have a look at types of Machine learning which clarify these algorithms. Machine learning algorithms are programs that can learn from data and improve from experience, without human interference. Learning tasks may include learning the function that drafts the input to the output, learning the hidden structure in unlabeled data; or'instance-based learning', where a class label is produced for a new instance by analyzing the new instance (row) to instances from the training data, which were stored in memory. Machine Learning algorithm is an evolution of the regular algorithm.
Fresh from the arXiv: Oct 21 to 25
This episode of Fresh from the arXiv is going to be a little different. Normally I skim through all of the AI, computer vision and NLP preprints that came out during the week and pick a few that I consider particularly interesting. Often there is a common theme uniting a few of my choices, but the idea is not really to zoom in on any particular subject. Last week, however, I could not help but fall down a rabbit hole called semi-supervised learning with GANs. I ended up putting together a little introduction into the topic that is not too technical (meaning, it should be understandable to anyone with a vague idea of how a vanilla unsupervised GAN operates), but also provides a few directions to explore in more detail should you be interested.
Auto-Annotation Quality Prediction for Semi-Supervised Learning with Ensembles
Simon, Dror, Farber, Miriam, Goldenberg, Roman
Auto-annotation by ensemble of models is an efficient method of learning on unlabeled data. Wrong or inaccurate annotations generated by the ensemble may lead to performance degradation of the trained model. To deal with this problem we propose filtering the auto-labeled data using a trained model that predicts the quality of the annotation from the degree of consensus between ensemble models. Using semantic segmentation as an example, we show the advantage of the proposed auto-annotation filtering over training on data contaminated with inaccurate labels. Moreover, our experimental results show that in the case of semantic segmentation, the performance of a state-of-the-art model can be achieved by training it with only a fraction (30$\%$) of the original manually labeled data set, and replacing the rest with the auto-annotated, quality filtered labels.
A Unified Framework for Data Poisoning Attack to Graph-based Semi-supervised Learning
Liu, Xuanqing, Si, Si, Zhu, Xiaojin, Li, Yang, Hsieh, Cho-Jui
In this paper, we proposed a general framework for data poisoning attacks to graph-based semi-supervised learning (G-SSL). In this framework, we first unify different tasks, goals, and constraints into a single formula for data poisoning attack in G-SSL, then we propose two specialized algorithms to efficiently solve two important cases --- poisoning regression tasks under $\ell_2$-norm constraint and classification tasks under $\ell_0$-norm constraint. In the former case, we transform it into a non-convex trust region problem and show that our gradient-based algorithm with delicate initialization and update scheme finds the (globally) optimal perturbation. For the latter case, although it is an NP-hard integer programming problem, we propose a probabilistic solver that works much better than the classical greedy method. Lastly, we test our framework on real datasets and evaluate the robustness of G-SSL algorithms. For instance, on the MNIST binary classification problem (50000 training data with 50 labeled), flipping two labeled data is enough to make the model perform like random guess (around 50\% error).
Investigating Under and Overfitting in Wasserstein Generative Adversarial Networks
Adlam, Ben, Weill, Charles, Kapoor, Amol
We investigate under and overfitting in Generative Adversarial Networks (GANs), using discriminators unseen by the generator to measure generalization. We find that the model capacity of the discriminator has a significant effect on the generator's model quality, and that the generator's poor performance coincides with the discriminator underfitting. Contrary to our expectations, we find that generators with large model capacities relative to the discriminator do not show evidence of overfitting on CIFAR10, CIFAR100, and CelebA.
Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs
Mercado, Pedro, Tudisco, Francesco, Hein, Matthias
We study the task of semi-supervised learning on multilayer graphs by taking into account both labeled and unlabeled observations together with the information encoded by each individual graph layer. We propose a regularizer based on the generalized matrix mean, which is a one-parameter family of matrix means that includes the arithmetic, geometric and harmonic means as particular cases. We analyze it in expectation under a Multilayer Stochastic Block Model and verify numerically that it outperforms state of the art methods. Moreover, we introduce a matrix-free numerical scheme based on contour integral quadratures and Krylov subspace solvers that scales to large sparse multilayer graphs.