Inductive Learning
Modeling Pipeline Optimization With scikit-learn
This tutorial presents two essential concepts in data science and automated learning. One is the machine learning pipeline, and the second is its optimization. These two principles are the key to implementing any successful intelligent system based on machine learning. A machine learning pipeline can be created by putting together a sequence of steps involved in training a machine learning model. It can be used to automate a machine learning workflow.
The Importance of Open-Source ML Datasets
'Data is the new oil' is an over-marketed quote, but one that is certainly true when it comes to machine learning (ML). In an ML world dominated by supervised learning techniques, having access to high-quality labeled datasets is essential to advance ML research and practical implementations. However, labeled datasets are computationally expensive to produce and remain a privilege of large companies, which increases the gap between the "haves" and the "have nots" in the ML space. Beyond the impact in the economics of the ML market, access to high-quality datasets is fundamental to advance research in different ML fields. Datasets such as ImageNet were kind of a Sputnik moment (we mean the first artificial satellite) in ML, sparking remarkable breakthroughs in computer vision.
Popular Machine learning algorithms
Machine learning algorithms are programs that can gain from information and improve from experience, without human intercession. Learning tasks may incorporate learning the function that maps the input directly to the output, learning the concealed structure in unlabeled information; or instance-based learning, where a class name is delivered for another instance by looking at the new case (column) to cases from the training information, which were put away in memory. In case you are willing to master these algorithms, join a course to learn machine learning in India and get exposure to machine learning tools, algorithms and their real-time usage. There are three kinds of Machine Learning strategies, i.e. -reinforcement learning, unsupervised learning, and unsupervised learning. It is one of the most famous AI algorithms being used today; this one is a supervised learning algorithm that is utilized for classifying issues. In this algorithm, we split the populace into at least two homogeneous sets dependent on the most noteworthy traits/free factors.
Intrinsically Motivated Self-supervised Learning in Reinforcement Learning
Zhao, Yue, Du, Chenzhuang, Zhao, Hang, Li, Tiejun
In vision-based reinforcement learning (RL) tasks, it is prevalent to assign the auxiliary task with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in the auxiliary task, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.
Training Data Subset Selection for Regression with Controlled Generalization Error
Sivasubramanian, Durga, Iyer, Rishabh, Ramakrishnan, Ganesh, De, Abir
Data subset selection from a large number of training instances has been a successful approach toward efficient and cost-effective machine learning. However, models trained on a smaller subset may show poor generalization ability. In this paper, our goal is to design an algorithm for selecting a subset of the training data, so that the model can be trained quickly, without significantly sacrificing on accuracy. More specifically, we focus on data subset selection for L2 regularized regression problems and provide a novel problem formulation which seeks to minimize the training loss with respect to both the trainable parameters and the subset of training data, subject to error bounds on the validation set. We tackle this problem using several technical innovations. First, we represent this problem with simplified constraints using the dual of the original training problem and show that the objective of this new representation is a monotone and alpha-submodular function, for a wide variety of modeling choices. Such properties lead us to develop SELCON, an efficient majorization-minimization algorithm for data subset selection, that admits an approximation guarantee even when the training provides an imperfect estimate of the trained model. Finally, our experiments on several datasets show that SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Gao, Yanjun, Huang, Ting-hao, Passonneau, Rebecca J.
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
Credal Self-Supervised Learning
Lienen, Julian, Hรผllermeier, Eyke
Self-training is an effective approach to semi-supervised learning. The key idea is to let the learner itself iteratively generate "pseudo-supervision" for unlabeled instances based on its current hypothesis. In combination with consistency regularization, pseudo-labeling has shown promising performance in various domains, for example in computer vision. To account for the hypothetical nature of the pseudo-labels, these are commonly provided in the form of probability distributions. Still, one may argue that even a probability distribution represents an excessive level of informedness, as it suggests that the learner precisely knows the ground-truth conditional probabilities. In our approach, we therefore allow the learner to label instances in the form of credal sets, that is, sets of (candidate) probability distributions. Thanks to this increased expressiveness, the learner is able to represent uncertainty and a lack of knowledge in a more flexible and more faithful manner. To learn from weakly labeled data of that kind, we leverage methods that have recently been proposed in the realm of so-called superset learning. In an exhaustive empirical evaluation, we compare our methodology to state-of-the-art self-supervision approaches, showing competitive to superior performance especially in low-label scenarios incorporating a high degree of uncertainty.
Recent Deep Semi-supervised Learning Approaches and Related Works
The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.
Open-set Label Noise Can Improve Robustness Against Inherent Label Noise
Wei, Hongxin, Tao, Lue, Xie, Renchunzi, An, Bo
Learning with noisy labels is a practically challenging problem in weakly supervised learning. In the existing literature, open-set noises are always considered to be poisonous for generalization, similar to closed-set noises. In this paper, we empirically show that open-set noisy labels can be non-toxic and even benefit the robustness against inherent noisy labels. Inspired by the observations, we propose a simple yet effective regularization by introducing Open-set samples with Dynamic Noisy Labels (ODNL) into training. With ODNL, the extra capacity of the neural network can be largely consumed in a way that does not interfere with learning patterns from clean data. Through the lens of SGD noise, we show that the noises induced by our method are random-direction, conflict-free and biased, which may help the model converge to a flat minimum with superior stability and enforce the model to produce conservative predictions on Out-of-Distribution instances. Extensive experimental results on benchmark datasets with various types of noisy labels demonstrate that the proposed method not only enhances the performance of many existing robust algorithms but also achieves significant improvement on Out-of-Distribution detection tasks even in the label noise setting.