to

### Loss and Loss Functions for Training Deep Learning Neural Networks

Neural networks are trained using stochastic gradient descent and require that you choose a loss function when designing and configuring your model. There are many loss functions to choose from and it can be challenging to know what to choose, or even what a loss function is and the role it plays when training a neural network. In this post, you will discover the role of loss and loss functions in training deep learning neural networks and how to choose the right loss function for your predictive modeling problems. Loss and Loss Functions for Training Deep Learning Neural Networks Photo by Ryan Albrey, some rights reserved. A deep learning neural network learns to map a set of inputs to a set of outputs from training data.

### Connections: Log Likelihood, Cross Entropy, KL Divergence, Logistic Regression, and Neural Networks

Maximizing the (log) likelihood is equivalent to minimizing the binary cross entropy. There is literally no difference between the two objective functions, so there can be no difference between the resulting model or its characteristics. This of course, can be extended quite simply to the multiclass case using softmax cross-entropy and the so-called multinoulli likelihood, so there is no difference when doing this for multiclass cases as is typical in, say, neural networks. The difference between MLE and cross-entropy is that MLE represents a structured and principled approach to modeling and training, and binary/softmax cross-entropy simply represent special cases of that applied to problems that people typically care about. After that aside on maximum likelihood estimation, let's delve more into the relationship between negative log likelihood and cross entropy.

### A Gentle Introduction to Cross-Entropy for Machine Learning

Cross-entropy is commonly used in machine learning as a loss function. Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions. It is closely related to but is different from KL divergence that calculates the relative entropy between two probability distributions, whereas cross-entropy can be thought to calculate the total entropy between the distributions. Cross-entropy is also related to and often confused with logistic loss, called log loss. Although the two measures are derived from a different source, when used as loss functions for classification models, both measures calculate the same quantity and can be used interchangeably.

### A Gentle Introduction to Cross-Entropy for Machine Learning

Cross-entropy is commonly used in machine learning as a loss function. Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions. It is closely related to but is different from KL divergence that calculates the relative entropy between two probability distributions, whereas cross-entropy can be thought to calculate the total entropy between the distributions. Cross-entropy is also related to and often confused with logistic loss, called log loss. Although the two measures are derived from a different source, when used as loss functions for classification models, both measures calculate the same quantity and can be used interchangeably.

### softmax-classifiers-explained

Last week, we discussed Multi-class SVM loss; specifically, the hinge loss and squared hinge loss functions. In reality, these values would not be randomly generated -- they would instead be the output of your scoring function f. Let's exponentiate the output of the scoring function, yielding our unnormalized probabilities: Figure 2: Exponentiating the output values from the scoring function gives us our unnormalized probabilities. Figure 4: Taking the negative log of the probability for the correct ground-truth class yields the final loss for the data point. To examine some actual probabilities, let's loop over a few randomly sampled training examples and examine the output probabilities returned by the classifier: Note: I'm randomly sampling from the training data rather than the testing data to demonstrate that there should be a noticeably large gap in between the probabilities for each class label.