Inductive Learning
Self-organized inductive reasoning with NeMuS
Barreto, Leonardo, Mota, Edjard
In this direction, patterns of concepts can be used to justify (and explain) Neural Multi-Space (NeMuS) is a weighted multispace "shortcuts" to generate recursive hypothesis from very large representation for a portion of first-order sets of relations without the need to compute the entire path logic designed for use with machine learning and to justify it. This is critical when the background knowledge neural network methods. It was demonstrated that has huge amounts of data. It could be adequately handled it can be used to perform reasoning based on regions as regions of concepts and categories, similar to the human forming patterns of refutation and also in brain map organization. This will allow symbolic deduction the process of inductive learning in ILP-like style.
Label-less supervised learning? Enter self-supervised learning.
High-capacity networks are solving many different machine learning tasks, ranging from large-scale image classification, segmentation and image generation, to natural speech understanding and realistic text-to-speech, arguably passing some formulations of a Turing Test. A few general trends are easily identified in academia and industry: deeper networks show increasingly better results, as long as they are fed with ever bigger amounts of data, and labelled data in particular. Computational and economic costs increase linearly with the size of the dataset and for this reason, starting 2015 a number of unsupervised approaches aiming at the exploitation of unlabelled data are growing in popularity. The intuition behind many of these techniques is to emulate the ability of human brains to self determine the goal of a task and improve towards it. Starting 2015 advancements in algorithms able to exploit labels inherently contained within an unlabelled dataset gave rise to what is now referenced as self-supervised learning.
Manifold Graph with Learned Prototypes for Semi-Supervised Image Classification
Kuo, Chia-Wen, Ma, Chih-Yao, Huang, Jia-Bin, Kira, Zsolt
Recent advances in semi-supervised learning methods rely on estimating the categories of unlabeled data using a model trained on the labeled data (pseudo-labeling) and using the unlabeled data for various consistency-based regularization. In this work, we propose to explicitly leverage the structure of the data manifold based on a Manifold Graph constructed over the image instances within the feature space. Specifically, we propose an architecture based on graph networks that jointly optimizes feature extraction, graph connectivity, and feature propagation and aggregation to unlabeled data in an end-to-end manner. Further, we present a novel Prototype Generator for producing a diverse set of prototypes that compactly represent each category, which supports feature propagation. To evaluate our method, we first contribute a strong baseline that combines two consistency-based regularizers that already achieves state-of-the-art results especially with fewer labels. We then show that when combined with these regularizers, the proposed method facilitates the propagation of information from generated prototypes to image data to further improve results. We provide extensive qualitative and quantitative experimental results on semi-supervised benchmarks demonstrating the improvements arising from our design and show that our method achieves state-of-the-art performance when compared with existing methods using a single model and comparable with ensemble methods. Specifically, we achieve error rates of 3.35% on SVHN, 8.27% on CIFAR-10, and 33.83% on CIFAR-100. With much fewer labels, we surpass the state of the arts by significant margins of 41% relative error decrease on average.
New DeepMind Unsupervised Image Model Challenges AlexNet
While supervised learning has tremendously improved AI performance in image classification, a major drawback is its reliance on large-scale labeled datasets. This has prompted researchers to explore the potential of unsupervised learning and semi-supervised learning -- techniques that forego data annotation but have their own drawback: diminished accuracy. A new paper from Google's UK-based research company DeepMind addresses this with a model based on Contrastive Predictive Coding (CPC) that outperforms the fully-supervised AlexNet model in Top-1 and Top-5 accuracy on ImageNet. CPC was introduced by DeepMind in 2018. The unsupervised learning approach uses a powerful autoregressive model to extract representations of high-dimensional data to predict future samples.
A cost-reducing partial labeling estimator in text classification problem
Chen, Jiangning, Dai, Zhibo, Duan, Juntao, Hu, Qianli, Li, Ruilin, Matzinger, Heinrich, Popescu, Ionel, Zhai, Haoyan
We propose a new approach to address the text classification problems when learning with partial labels is beneficial. Instead of offering each training sample a set of candidate labels, we assign negative-oriented labels to the ambiguous training examples if they are unlikely fall into certain classes. We construct our new maximum likelihood estimators with self-correction property, and prove that under some conditions, our estimators converge faster. Also we discuss the advantages of applying one of our estimator to a fully supervised learning problem. The proposed method has potential applicability in many areas, such as crowdsourcing, natural language processing and medical image analysis.
Selfie: Self-supervised Pretraining for Image Embedding
Trinh, Trieu H., Luong, Minh-Thang, Le, Quoc V.
We introduce a pretraining technique called Selfie, which stands for SELF-supervised Image Embedding. Selfie generalizes the concept of masked language modeling to continuous data, such as images. Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This classification objective sidesteps the need for predicting exact pixel values of the target patches. The pretraining architecture includes a network of convolutional blocks to process patches followed by an attention pooling network to summarize the content of unmasked patches before predicting masked ones. During finetuning, we reuse the convolutional weights found by pretraining. We evaluate our method on three benchmarks (CIFAR-10, ImageNet 32 x 32, and ImageNet 224 x 224) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improvements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably, on ImageNet 224 x 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy. Our pretraining method also improves ResNet-50 training stability, especially on low data regime, by significantly lowering the standard deviation of test accuracies across datasets.
Rectifying Classifier Chains for Multi-Label Classification
Senge, Robin, del Coz, Juan Josรฉ, Hรผllermeier, Eyke
Classifier chains have recently been proposed as an appealing method for tackling the multi-label classification task. In addition to several empirical studies showing its state-of-the-art performance, especially when being used in its ensemble variant, there are also some first results on theoretical properties of classifier chains. Continuing along this line, we analyze the influence of a potential pitfall of the learning process, namely the discrepancy between the feature spaces used in training and testing: While true class labels are used as supplementary attributes for training the binary models along the chain, the same models need to rely on estimations of these labels at prediction time. We elucidate under which circumstances the attribute noise thus created can affect the overall prediction performance. As a result of our findings, we propose two modifications of classifier chains that are meant to overcome this problem. Experimentally, we show that our variants are indeed able to produce better results in cases where the original chaining process is likely to fail.
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
Spigler, Stefano, Geiger, Mario, Wyart, Matthieu
How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as $n^{-\beta}$ where $n$ is the number of training examples and $\beta$ an exponent that depends on both data and algorithm. In this work we measure $\beta$ when applying kernel methods to real datasets. For MNIST we find $\beta\approx 0.4$ and for CIFAR10 $\beta\approx 0.1$. Remarkably, $\beta$ is the same for regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we introduce the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption --- namely that the data are sampled from a regular lattice --- we derive analytically $\beta$ for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, $\beta$ depends only on the training data and their dimension. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, our results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data should be defined in terms of how the distance between nearest data points depends on $n$. With this definition one obtains reasonable effective smoothness estimates for MNIST and CIFAR10.
Machine Learning in R for beginners
You see that the model makes reasonably accurate predictions, with the exception of one wrong classification in row 29, where "Versicolor" was predicted while the test label is "Virginica". This is already some indication of your model's performance, but you might want to go even deeper into your analysis.