generation generation problem
Sharpness-Aware Minimization in Genetic Programming
Bakurov, Illya, Haut, Nathan, Banzhaf, Wolfgang
The automatic discovery of mathematical expressions to describe phenomena captured in data is an extremely valuable tool for accelerating scientific discovery since the mathematical expressions can be used to make predictions about the systems that generated the data and the expressions can be directly studied to extract new insights into the system. There are many approaches for finding equations that fit data: linear regression, polynomial regression, SINDy [7], neural-symbolic regression [6], symbolic regression [19], etc. Genetic programming (GP) is a popular method for finding equations that fit data since it allows greater flexibility for the discovery of non-linear behaviors in data while also being effective in small data scenarios, unlike deep learning (DL) approaches which generally require large training data sets. This ability of GP to be effective in small data scenarios is likely in some part due to evolution's bias for simple solutions, and naturally simple solutions are less likely to overfit [5]. Even so, in small data scenarios, the models are naturally underconstrained in the interstitial spaces between the training data points, which means that surprising and unexpected behavior can occur when interpolating. Ideally, we would want the models to be at least stable (smooth) when interpolating, otherwise trust in the models can be severely diminished. Some GP methods have been proposed to help lock down the behavior of models in these interstitial spaces to improve the robustness against overfitting in small data scenarios such as order of non-linearity [33], model curvature [30], random sampling technique (RST) [14], RelaxGP [8], and overfit repulsors [31]. Order of non-linearity and model curvature are approaches that attempt to take properties of the model to predict if they are overfitting [30, 33]. Random sampling attempts to reduce the risk of overfitting by ensuring that no model sees the whole data set in a single generation [14].