displaystyle
Statistical Effect Size and Python Implementation - Analytics Vidhya
Then, we calculate the ratio of the weighted sum of the squares of the differences between each category's average and overall average to the sum of squares between each value and overall average. The range of eta is between 0 and 1. A value closer to 0 indicates all categories have similar values, and any single category doesn't have more influence on variable y. A value closer to 1 indicates one or more categories have different values than other categories and have more influence on variable y. Eta can be used in EDA and data processing to know which categorical features are more important in machine learning model building.
Bellman equation
A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming.[1] It writes the "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices.[citation The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory; though the basic concepts of dynamic programming are prefigured in John von Neumann and Oskar Morgenstern's Theory of Games and Economic Behavior and Abraham Wald's sequential analysis.[citation In continuous-time optimization problems, the analogous equation is a partial differential equation that is called the Hamilton–Jacobi–Bellman equation.[4][5] In discrete time any multi-stage optimization problem can be solved by analyzing the appropriate Bellman equation.
Backpropagation
In machine learning, backpropagation (backprop,[1] BP) is a widely used algorithm for training feedforward neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions generally. These classes of algorithms are all referred to generically as "backpropagation".[2] In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants such as stochastic gradient descent, are commonly used. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.[3]
Simulated annealing - Wikipedia
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It is often used when the search space is discrete (e.g., the traveling salesman problem). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to exact algorithms such as gradient descent, Branch and Bound. The name of the algorithm comes from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects.
Ant colony optimization algorithms - Wikipedia
In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. Artificial Ants stand for multi-agent methods inspired by the behavior of real ants. The pheromone-based communication of biological ants is often the predominant paradigm used.[2] Combinations of Artificial Ants and local search algorithms have become a method of choice for numerous optimization tasks involving some sort of graph, e.g., vehicle routing and internet routing. The burgeoning activity in this field has led to conferences dedicated solely to Artificial Ants, and to numerous commercial applications by specialized companies such as AntOptima. As an example, Ant colony optimization[3] is a class of optimization algorithms modeled on the actions of an ant colony. Real ants lay down pheromones directing each other to resources while exploring their environment. The simulated'ants' similarly record their positions and the quality of their solutions, so that in later simulation iterations more ants locate better solutions.[4]
Confusion Matrix and it's 25 offspring: or the link between machine learning and epidemiology Dr. Yury Zablotski
For instance, an LR of 3 suggests that for every false positive, there are 3 true positives. The greater the value of the LR for a particular test, the more likely a positive test result is a true positive. On the other hand, an LR 1 would imply that an individual with a positive test result is more likely to be non-diseased than diseased. The rationale for the diagnostic odds ratio is that it is a single indicator of test performance (like accuracy and Youden's J index, explained below) which is independent of prevalence (unlike accuracy) and is presented as an odds ratio, which is familiar to epidemiologists. Similarly to a usual odds ratio, the diagnostic odds ratio ranges from zero to infinity, where DOR greater then one is already good, and the higher DOR goes, the better the test performs. DOR of less than one indicates that the test performs bad, or even gives wrong information.
Pattern recognition - Wikipedia
Pattern recognition is the automated recognition of patterns and regularities in data. Pattern recognition is closely related to artificial intelligence and machine learning,[1] together with applications such as data mining and knowledge discovery in databases (KDD), and is often used interchangeably with these terms. However, these are distinguished: machine learning is one approach to pattern recognition, while other approaches include hand-crafted (not learned) rules or heuristics; and pattern recognition is one approach to artificial intelligence, while other approaches include symbolic artificial intelligence.[2] The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.[3] This article focuses on machine learning approaches to pattern recognition.
Autoencoder - Wikipedia
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner.[1] The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise". Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties.[2] Examples are the regularized autoencoders (Sparse, Denoising and Contractive autoencoders), proven effective in learning representations for subsequent classification tasks,[3] and Variational autoencoders, with their recent applications as generative models.[4] Autoencoders are effectively used for solving many applied problems, from face recognition[5] to acquiring the semantic meaning of words.[6][7]
Bias–variance tradeoff - Wikipedia
Suppose that we have a training set consisting of a set of points x 1, …, x n {\displaystyle x_{1},\dots,x_{n}} and real values y i {\displaystyle y_{i}} associated with each point x i {\displaystyle x_{i}} . We want to find a function f ( x) {\displaystyle {\hat {f}}(x)}, that approximates the true function f ( x) {\displaystyle f(x)} as well as possible, by means of some learning algorithm. We make "as well as possible" precise by measuring the mean squared error between y {\displaystyle y} and f ( x) {\displaystyle {\hat {f}}(x)}: we want ( y f ( x)) 2 {\displaystyle (y-{\hat {f}}(x)) {2}} to be minimal, both for x 1, …, x n {\displaystyle x_{1},\dots,x_{n}} and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the y i {\displaystyle y_{i}} contain noise ε {\displaystyle \varepsilon }; this means we must be prepared to accept an irreducible error in any function we come up with. Finding an f {\displaystyle {\hat {f}}} that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning.