This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning. As part of my role as course director of the Artificial Intelligence: Cloud and Edge Computing at the University..., I see more students who are familiar with programming than with mathematics. They have last learnt maths years ago at University. And then, suddenly they find that they encounter matrices, linear algebra etc when they start learning Data Science.

Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression. The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic;). So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function"). Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.

Cesa-bianchi, Nicolò, Gentile, Claudio, Zaniboni, Luca

We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently runin any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, butusing much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results:Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels.