We know in neural networks, neurons work with corresponding weight, bias and their respective activation functions. The weights get multiplied with the inputs and then activation function is applied to the element before going to the next layer. Finally, we get the predicted value (yhat) through the output layer. But prediction is always closer to the actual (y), which we term as errors. So, we define the loss/cost functions to capture the errors and try to optimize it though backpropagation.

When introduced to machine learning, practically oriented textbooks and online courses focus on two major loss functions, the squared error for regression tasks and cross entropy for classification tasks, usually with no justification for why these two are important. Before we dive into why we might be interested in these loss functions, let's ensure that we're on the same page and quickly recall how they are defined. To explain why these two losses achieve what we want, we first need to agree on what exactly it is that we want to achieve. Let's consider a running regression example. In this case we're trying to estimate the value of a variable, which for instance could be the number of active Twitter users worldwide in given quarter: We assume here that there is a true answer, meaning that there is a distribution which will accurately model the number of Twitter users throughout all time.

Machine learning has attracted interests from various fields as a powerful tool in finding patterns in data. Supported by machine learning technology, computer programs can improve automatically through experience, which has enabled a wide spectrum of applications: from visual and speech recognition, effective web search, to study of human genomics [1, 2]. Classical machine learning techniques have also found many interesting applications in different disciplines of quantum physics [3, 4, 5, 6, 7, 8, 9, 10]. With the advancement of quantum information science and technology, there are both theoretical and practical interests in understanding quantum systems, building quantum devices, developing quantum algorithms, and ultimately, taking advantages of quantum supremacy [11, 12].

Maximizing the (log) likelihood is equivalent to minimizing the binary cross entropy. There is literally no difference between the two objective functions, so there can be no difference between the resulting model or its characteristics. This of course, can be extended quite simply to the multiclass case using softmax cross-entropy and the so-called multinoulli likelihood, so there is no difference when doing this for multiclass cases as is typical in, say, neural networks. The difference between MLE and cross-entropy is that MLE represents a structured and principled approach to modeling and training, and binary/softmax cross-entropy simply represent special cases of that applied to problems that people typically care about. After that aside on maximum likelihood estimation, let's delve more into the relationship between negative log likelihood and cross entropy.