Supervised Learning: Instructional Materials

A Gentle Introduction to Vector Space Models


Vector space models are to consider the relationship between data that are represented by vectors. It is popular in information retrieval systems but also useful for other purposes. Generally, this allows us to compare the similarity of two vectors from a geometric perspective. In this tutorial, we will see what is a vector space model and what it can do. A Gentle Introduction to Vector Space Models Photo by liamfletch, some rights reserved.

Conical Classification For Computationally Efficient One-Class Topic Determination Artificial Intelligence

As the Internet grows in size, so does the amount of text based information that exists. For many application spaces it is paramount to isolate and identify texts that relate to a particular topic. While one-class classification would be ideal for such analysis, there is a relative lack of research regarding efficient approaches with high predictive power. By noting that the range of documents we wish to identify can be represented as positive linear combinations of the Vector Space Model representing our text, we propose Conical classification, an approach that allows us to identify if a document is of a particular topic in a computationally efficient manner. We also propose Normal Exclusion, a modified version of Bi-Normal Separation that makes it more suitable within the one-class classification context. We show in our analysis that our approach not only has higher predictive power on our datasets, but is also faster to compute.

Multi-Task Learning based Online Dialogic Instruction Detection with Pre-trained Language Models Artificial Intelligence

In this work, we study computational approaches to detect online dialogic instructions, which are widely used to help students understand learning materials, and build effective study habits. This task is rather challenging due to the widely-varying quality and pedagogical styles of dialogic instructions. To address these challenges, we utilize pre-trained language models, and propose a multi-task paradigm which enhances the ability to distinguish instances of different classes by enlarging the margin between categories via contrastive loss. Furthermore, we design a strategy to fully exploit the misclassified examples during the training stage. Extensive experiments on a real-world online educational data set demonstrate that our approach achieves superior performance compared to representative baselines.

Modeling Pipeline Optimization With scikit-learn


This tutorial presents two essential concepts in data science and automated learning. One is the machine learning pipeline, and the second is its optimization. These two principles are the key to implementing any successful intelligent system based on machine learning. A machine learning pipeline can be created by putting together a sequence of steps involved in training a machine learning model. It can be used to automate a machine learning workflow.

Knowledge Consolidation based Class Incremental Online Learning with Limited Data Artificial Intelligence

We propose a novel approach for class incremental online learning in a limited data setting. This problem setting is challenging because of the following constraints: (1) Classes are given incrementally, which necessitates a class incremental learning approach; (2) Data for each class is given in an online fashion, i.e., each training example is seen only once during training; (3) Each class has very few training examples; and (4) We do not use or assume access to any replay/memory to store data from previous classes. Therefore, in this setting, we have to handle twofold problems of catastrophic forgetting and overfitting. In our approach, we learn robust representations that are generalizable across tasks without suffering from the problems of catastrophic forgetting and overfitting to accommodate future classes with limited samples. Our proposed method leverages the meta-learning framework with knowledge consolidation. The meta-learning framework helps the model for rapid learning when samples appear in an online fashion. Simultaneously, knowledge consolidation helps to learn a robust representation against forgetting under online updates to facilitate future learning. Our approach significantly outperforms other methods on several benchmarks.

Open-world Machine Learning: Applications, Challenges, and Opportunities Artificial Intelligence

Traditional machine learning especially supervised learning follows the assumptions of closed-world learning i.e., for each testing class a training class is available. However, such machine learning models fail to identify the classes which were not available during training time. These classes can be referred to as unseen classes. Whereas, open-world machine learning deals with arbitrary inputs (data with unseen classes) to machine learning systems. Moreover, traditional machine learning is static learning which is not appropriate for an active environment where the perspective and sources, and/or volume of data are changing rapidly. In this paper, first, we present an overview of open-world learning with importance to the real-world context. Next, different dimensions of open-world learning are explored and discussed. The area of open-world learning gained the attention of the research community in the last decade only. We have searched through different online digital libraries and scrutinized the work done in the last decade. This paper presents a systematic review of various techniques for open-world machine learning. It also presents the research gaps, challenges, and future directions in open-world learning. This paper will help researchers to understand the comprehensive developments of open-world learning and the likelihoods to extend the research in suitable areas. It will also help to select applicable methodologies and datasets to explore this further.

What is Focal Loss and when should you use it?


In this blogpost we will understand what Focal Loss and when is it used. We will also take a dive into the math and implement it in PyTorch. Where was Focal Loss introduced and what was it used for? So, why did that work? What did Focal Loss do to make it work? Alpha and Gamma? How to implement this in code? Credits Where was Focal Loss introduced and what was it used for? Before understanding what Focal Loss is and all the details about it, let’s first quickly get an intuitive understanding of what Focal Loss actually does. Focal loss was implemented in Focal Loss for Dense Object Detection paper by He et al. For years before this paper, Object Detection was actually considered a very difficult problem to solve and it was especially considered very hard to detect small size objects inside images. See example below where the model doesn’t predict anything for the motorbike which is of relatively smaller size compared to other images. The reason why in the image above, the bike is not predicted by the model is because this model was trained using Binary Cross Entropy loss which really asks the model to be confident about what is predicting. Whereasm, what Focal Loss does is that it makes it easier for the model to predict things without being 80-100% sure that this object is “something”. In simple words, giving the model a bit more freedom to take some risk when making predictions. This is particularly important when dealing with highly imbalanced datasets because in some cases (such as cancer detection), we really need to model to take a risk and predict something even if the prediction turns out to be a False Positive. Therefore, Focal Loss is particularly useful in cases where there is a class imbalance. Another example, is in the case of Object Detection when most pixels are usually background and only very few pixels inside an image sometimes have the object of interest. OK - so focal loss was introduced in 2017, and is pretty helpful in dealing with class imbalance - great! By the way, here are the predictions of the same model when trained with Focal Loss. This might be a good time to actually analyse the two and observe the differences. This will help get an intuitive understanding about Focal Loss. So, why did that work? What did Focal Loss do to make it work? So now that we have seen an example of what Focal Loss can do, let’s try and understand why that worked. The most important bit to understand about Focal Loss is the graph below: In the graph above, the “blue” line represents the Cross Entropy Loss. The X-axis or ‘probability of ground truth class’ (let’s call it pt for simplicity) is the probability that the model predicts for the ground truth object. As an example, let’s say the model predicts that something is a bike with probability 0.6 and it actually is a bike. The in this case pt is 0.6. Also, consider the same example but this time the object is not a bike. Then pt is 0.4 because ground truth here is 0 and probability that the object is not a bike is 0.4 (1-0.6). The Y-axis is simply the loss value given pt. As can be seen from the image, when the model predicts the ground truth with a probability of 0.6, the Cross Entropy Loss is still somewhere around 0.5. Therefore, to reduce the loss, our model would have to predict the ground truth label with a much higher probability. In other words, Cross Entropy Loss asks the model to be very confident about the ground truth prediction. This in turn can actually impact the performance negatively: The Deep Learning model can actually become overconfident and therefore, the model wouldn’t generalize well. This problem of overconfidence is also highlighted in this excellent paper Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. Also, Label Smoothing which was introduced as part of Rethinking the Inception Architecture for Computer Vision is another way to deal with the problem. Focal Loss is different from the above mentioned solutions. As can be seen from the graph Compare FL with CE, using Focal Loss with γ>1 reduces the loss for “well-classified examples” or examples when the model predicts the right thing with probability > 0.5 whereas, it increases loss for “hard-to-classify examples” when the model predicts with probability < 0.5. Therefore, it turns the models attention towards the rare class in case of class imbalance. The Focal Loss is mathematically defined as: Scary? It’s rather quite intuitive - read on :) Alpha and Gamma? So, what the hell are these alpha and gamma in Focal Loss? Also, we will now represent alpha as α and gamma as γ. Here is my understanding from fig-3: γ controls the shape of the curve. The higher the value of γ, the lower the loss for well-classified examples, so we could turn the attention of the model more towards ‘hard-to-classify examples. Having higher γ extends the range in which an example receives low loss. Also, when γ=0, this equation is equivalent to Cross Entropy Loss. How? Well, for the mathematically inclined, Cross Entropy Loss is defined as: After some refactoring and defining pt as below: Putting eq-3 in eq-2, our Cross Entropy Loss therefore, becomes: Therefore, at γ=0, eq-1 becomes equivalent to eq-4 that is Focal Loss becomes equivalent to Cross Entropy Loss. Here is an excellent blogpost that explains Cross Entropy Loss. Ok, great! So now we know what γ does, but, what does α do? Another way, apart from Focal Loss, to deal with class imbalance is to introduce weights. Give high weights to the rare class and small weights to the dominating or common class. These weights are referred to as α. Adding these weights does help with class imbalance however, the focal loss paper reports: The large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. What the authors are trying to explain is this: Even when we add α, while it does add different weights to different classes, thereby balancing the importance of positive/negative examples - just doing this in most cases is not enough. What we also want to do is to reduce the loss of easily-classified examples because otherwise these easily-classified examples would dominate our training. So, how does Focal Loss deal with this? It adds a multiplicative factor to Cross Entropy loss and this multiplicative factor is (1 − pt)**γ where pt as you remember is the probability of the ground truth label. From the paper for Focal Loss: We propose to add a modulating factor (1 − pt)**γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0. Really? Is that all that the authors have done? That is to add (1 − pt)**γ to Cross Entropy Loss? Yes!! Remember eq-4? How to implement this in code? While TensorFlow provides this loss function here, this is not inherently supported by PyTorch so we have to write a custom loss function. Here is the implementation of Focal Loss in PyTorch: class WeightedFocalLoss(nn.Module): "Non weighted version of Focal Loss" def __init__(self, alpha=.25, gamma=2): super(WeightedFocalLoss, self).__init__() self.alpha = torch.tensor([alpha, 1-alpha]).cuda() self.gamma = gamma def forward(self, inputs, targets): BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none') targets = targets.type(torch.long) at = self.alpha.gather(0, pt = torch.exp(-BCE_loss) F_loss = at*(1-pt)**self.gamma * BCE_loss return F_loss.mean() If you’ve understood the meaning of alpha and gamma then this implementation should also make sense. Because, similar to the paper it is simply adding a factor of at*(1-pt)**self.gamma to the BCE_loss or Binary Cross Entropy Loss. Credits Please feel free to let me know via twitter if you did end up trying Focal Loss after reading this and whether you did see an improvement in your results! Thanks for reading! The implementation of Focal Loss has been adapted from here. fig-1 and fig-2 are from the Fastai 2018 course Lecture-09!

Multiple instance active learning for object detection Artificial Intelligence

Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection. In this paper, we propose Multiple Instance Active Object Detection (MI-AOD), to select the most informative images for detector training by observing instance-level uncertainty. MI-AOD defines an instance uncertainty learning module, which leverages the discrepancy of two adversarial instance classifiers trained on the labeled set to predict instance uncertainty of the unlabeled set. MI-AOD treats unlabeled images as instance bags and feature anchors in images as instances, and estimates the image uncertainty by re-weighting instances in a multiple instance learning (MIL) fashion. Iterative instance uncertainty learning and re-weighting facilitate suppressing noisy instances, toward bridging the gap between instance uncertainty and image-level uncertainty. Experiments validate that MI-AOD sets a solid baseline for instance-level active learning. On commonly used object detection datasets, MI-AOD outperforms state-of-the-art methods with significant margins, particularly when the labeled sets are small. Code is available at

Patterns, predictions, and actions: A story about machine learning Machine Learning

This graduate textbook on machine learning tells a story of how patterns in data support predictions and consequential actions. Starting with the foundations of decision making, we cover representation, optimization, and generalization as the constituents of supervised learning. A chapter on datasets as benchmarks examines their histories and scientific bases. Self-contained introductions to causality, the practice of causal inference, sequential decision making, and reinforcement learning equip the reader with concepts and tools to reason about actions and their consequences. Throughout, the text discusses historical context and societal impact. We invite readers from all backgrounds; some experience with probability, calculus, and linear algebra suffices.

Disambiguation of weak supervision with exponential convergence rates Artificial Intelligence

In many applications of machine learning, such as recommender systems, where an input characterizing a user should be matched with a target representing an ordering of a large number of items, accessing fully supervised data (,) is not an option. Instead, one should expect weak information on the target, which could be a list of previously taken (if items are online courses), watched (if items are plays), etc., items by a user characterized by the feature vector. This motivates weakly supervised learning, aiming at learning a mapping from inputs to targets in such a setting where tools from supervised learning can not be applied off-the-shelves. Recent applications of weakly supervised learning showcase impressive results in solving complex tasks such as action retrieval on instructional videos (Miech et al., 2019), image semantic segmentation (Papandreou et al., 2015), salient object detection (Wang et al., 2017), 3D pose estimation (Dabral et al., 2018), text-to-speech synthesis (Jia et al., 2018), to name a few. However, those applications of weakly supervised learning are usually based on clever heuristics, and theoretical foundations of learning from weakly supervised data are scarce, especially when compared to statistical learning literature on supervised learning (Vapnik, 1995; Boucheron et al., 2005; Steinwart and Christmann, 2008). We aim to provide a step in this direction. In this paper, we focus on partial labelling, a popular instance of weak supervision, approached with a structured prediction point of view Ciliberto et al. (2020). We detail this setup in Section 2. Our contributions are organized as follows.