Gradient Boosting, Decision Trees and XGBoost with CUDA Parallel Forall

In this post I look at the popular gradient boosting algorithm XGBoost and show how to apply CUDA and parallel algorithms to greatly decrease training times in decision tree algorithms. This means that it takes a set of labelled training instances as input and builds a model that aims to correctly predict the label of each training example based on other non-label information that we know about the example (known as features of the instance). Figure 1 shows a simple decision tree model (I'll call it "Decision Tree 0") with two decision nodes and three leaves. This extension of the loss function adds penalty terms for adding new decision tree leaves to the model with penalty proportional to the size of the leaf weights.

In Raw Numpy: t-SNE

To ensure the perplexity of each row of $$P$$, $$Perp(P_i)$$, is equal to our desired perplexity, we simply perform a binary search over each $$\sigma_i$$ until $$Perp(P_i)$$ our desired perplexity. It takes a matrix of negative euclidean distances and a target perplexity. Let's also define a p_joint function that takes our data matrix $$\textbf{X}$$ and returns the matrix of joint probabilities $$P$$, estimating the required $$\sigma_i$$'s and conditional probabilities matrix along the way: So we have our joint distributions $$p$$ and $$q$$. The only real difference is how we define the joint probability distribution matrix $$Q$$, which has entries $$q_{ij}$$.

Facebook and Microsoft introduce new open ecosystem for interchangeable AI frameworks

ONNX is the first step toward an open ecosystem where AI developers can easily move between state-of-the-art tools and choose the combination that is best for them. People experimenting with new models, and particularly those in research, want maximum flexibility and expressiveness in writing neural networks -- ranging from dynamic neural networks to supporting gradients of gradients, while keeping a bread-and-butter ConvNet performant. This is the first step in enabling us to rapidly move our latest research developments into production. We'll continue to evolve ONNX, PyTorch, Caffe2 to make sure developers have the latest tools for AI, so expect more updates soon!

Time series classification with Tensorflow

A similar situation arises in image classification, where manually engineered features (obtained by applying a number of filters) could be used in classification algorithms. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former. The rest of the implementation is pretty typical, and involve feeding the graph with batches of training data and evaluating the performance on a validation set. The rest is pretty standard for LSTM implementations, involving construction of layers (including dropout for regularization) and then an initial state.

Time series classification with Tensorflow

A similar situation arises in image classification, where manually engineered features (obtained by applying a number of filters) could be used in classification algorithms. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former. The rest of the implementation is pretty typical, and involve feeding the graph with batches of training data and evaluating the performance on a validation set. The rest is pretty standard for LSTM implementations, involving construction of layers (including dropout for regularization) and then an initial state.

Secret Sauce behind the beauty of Deep Learning: Beginners guide to Activation Functions

Activation functions are functions which take an input signal and convert it to an output signal. Neural networks are universal function approximators and deep Neural Networks are trained using backpropapagation which requires differentiable activation functions. Understanding activation functions is very important as they play a crucial role in the quality of deep neural networks. Conclusion: ReLU and it's variants should be preferred over sigmoid or tanh activation functions.

GRU implementation in TensorFlow

MLPs (Multi-Layer Perceptrons) are great for many classification and regression tasks, but it is hard for MLPs to do classification and regression on sequences. I had a hard time understanding this model, but it turns out that it is not too hard to understand. If is high, then the output at the current step is influenced a lot by the current input (), but it is not influenced a lot by the previous state (). In this tutorial, the model is capable of learning how to add two integer numbers (of any length).

Yet another introduction to Neural Networks

The backpropagation method simply takes the gradient of the loss function with respect to the state of the next layer ($$\nabla_{\bf Y} L$$) and computes the gradients with respect to the current state ($$\nabla_{\bf A}L$$), weights ($$\nabla_{\bf W}L$$) and biases ($$\nabla_{\bf b}L$$). The backward method implements the gradient of the loss function with respect to the outputs of the network. The layer class first linearly transforms the current state vectors $${\bf A}$$ and then feeds them into the activation layer to yield the input to the next layer $${\bf Y}$$ by the forward method. In the backward method, the incoming gradient is first backpropagated through the logistic update, then by the linear update to yield the gradients with respect to curent layer states, weights and biases.

Contouring learning rate to optimize neural nets

In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. It is imperative to selectively increase or decrease learning rate as training progresses in order to reach the global optimum or the desired destination. Plotting the cross entropy function might be more interpretable due to the log term simply because the learning process is mostly an exponential process taking the form of an exponential shape. If the validation curve closely follows the training curve, the network has trained correctly.