In this post I look at the popular gradient boosting algorithm XGBoost and show how to apply CUDA and parallel algorithms to greatly decrease training times in decision tree algorithms. This means that it takes a set of labelled training instances as input and builds a model that aims to correctly predict the label of each training example based on other non-label information that we know about the example (known as features of the instance). Figure 1 shows a simple decision tree model (I'll call it "Decision Tree 0") with two decision nodes and three leaves. This extension of the loss function adds penalty terms for adding new decision tree leaves to the model with penalty proportional to the size of the leaf weights.

To ensure the perplexity of each row of \(P\), \(Perp(P_i)\), is equal to our desired perplexity, we simply perform a binary search over each \(\sigma_i\) until \(Perp(P_i) \) our desired perplexity. It takes a matrix of negative euclidean distances and a target perplexity. Let's also define a p_joint function that takes our data matrix \(\textbf{X}\) and returns the matrix of joint probabilities \(P\), estimating the required \(\sigma_i\)'s and conditional probabilities matrix along the way: So we have our joint distributions \(p\) and \(q\). The only real difference is how we define the joint probability distribution matrix \(Q\), which has entries \(q_{ij}\).

"Gradient masking" is a term introduced in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. If the model's output is "99.9% airplane, 0.1% cat", then a little tiny change to the input gives a little tiny change to the output, and the gradient tells us which changes will increase the probability of the "cat" class. The defense strategies that perform gradient masking typically result in a model that is very smooth in specific directions and neighborhoods of training points, which makes it harder for the adversary to find gradients indicating good candidate directions to perturb the input in a damaging way for the model. Neither algorithm was explicitly designed to perform gradient masking, but gradient masking is apparently a defense that machine learning algorithms can invent relatively easily when they are trained to defend themselves and not given specific instructions about how to do so.

ONNX is the first step toward an open ecosystem where AI developers can easily move between state-of-the-art tools and choose the combination that is best for them. People experimenting with new models, and particularly those in research, want maximum flexibility and expressiveness in writing neural networks -- ranging from dynamic neural networks to supporting gradients of gradients, while keeping a bread-and-butter ConvNet performant. This is the first step in enabling us to rapidly move our latest research developments into production. We'll continue to evolve ONNX, PyTorch, Caffe2 to make sure developers have the latest tools for AI, so expect more updates soon!

A similar situation arises in image classification, where manually engineered features (obtained by applying a number of filters) could be used in classification algorithms. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former. The rest of the implementation is pretty typical, and involve feeding the graph with batches of training data and evaluating the performance on a validation set. The rest is pretty standard for LSTM implementations, involving construction of layers (including dropout for regularization) and then an initial state.

A similar situation arises in image classification, where manually engineered features (obtained by applying a number of filters) could be used in classification algorithms. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former. The rest of the implementation is pretty typical, and involve feeding the graph with batches of training data and evaluating the performance on a validation set. The rest is pretty standard for LSTM implementations, involving construction of layers (including dropout for regularization) and then an initial state.

Activation functions are functions which take an input signal and convert it to an output signal. Neural networks are universal function approximators and deep Neural Networks are trained using backpropapagation which requires differentiable activation functions. Understanding activation functions is very important as they play a crucial role in the quality of deep neural networks. Conclusion: ReLU and it's variants should be preferred over sigmoid or tanh activation functions.

MLPs (Multi-Layer Perceptrons) are great for many classification and regression tasks, but it is hard for MLPs to do classification and regression on sequences. I had a hard time understanding this model, but it turns out that it is not too hard to understand. If is high, then the output at the current step is influenced a lot by the current input (), but it is not influenced a lot by the previous state (). In this tutorial, the model is capable of learning how to add two integer numbers (of any length).

The backpropagation method simply takes the gradient of the loss function with respect to the state of the next layer (\(\nabla_{\bf Y} L\)) and computes the gradients with respect to the current state (\(\nabla_{\bf A}L\)), weights (\(\nabla_{\bf W}L\)) and biases (\(\nabla_{\bf b}L\)). The backward method implements the gradient of the loss function with respect to the outputs of the network. The layer class first linearly transforms the current state vectors \({\bf A}\) and then feeds them into the activation layer to yield the input to the next layer \({\bf Y}\) by the forward method. In the backward method, the incoming gradient is first backpropagated through the logistic update, then by the linear update to yield the gradients with respect to curent layer states, weights and biases.

In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. It is imperative to selectively increase or decrease learning rate as training progresses in order to reach the global optimum or the desired destination. Plotting the cross entropy function might be more interpretable due to the log term simply because the learning process is mostly an exponential process taking the form of an exponential shape. If the validation curve closely follows the training curve, the network has trained correctly.