to

### A Leisurely Look at Versions and Variants of the Cross Validation Estimator

Many versions of cross-validation (CV) exist in the literature; and each version though has different variants. All are used interchangeably by many practitioners; yet, without explanation to the connection or difference among them. This article has three contributions. First, it starts by mathematical formalization of these different versions and variants that estimate the error rate and the Area Under the ROC Curve (AUC) of a classification rule, to show the connection and difference among them. Second, we prove some of their properties and prove that many variants are either redundant or "not smooth". Hence, we suggest to abandon all redundant versions and variants and only keep the leave-one-out, the $K$-fold, and the repeated $K$-fold. We show that the latter is the only among the three versions that is "smooth" and hence looks mathematically like estimating the mean performance of the classification rules. However, empirically, for the known phenomenon of "weak correlation", which we explain mathematically and experimentally, it estimates both conditional and mean performance almost with the same accuracy. Third, we conclude the article with suggesting two research points that may answer the remaining question of whether we can come up with a finalist among the three estimators: (1) a comparative study, that is much more comprehensive than those available in literature and conclude no overall winner, is needed to consider a wide range of distributions, datasets, and classifiers including complex ones obtained via the recent deep learning approach. (2) we sketch the path of deriving a rigorous method for estimating the variance of the only "smooth" version, repeated $K$-fold CV, rather than those ad-hoc methods available in the literature that ignore the covariance structure among the folds of CV.

### Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: We first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with $\ell_1$ in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing.

### Ridge Regression: Structure, Cross-Validation, and Sketching

We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a covariance matrix-dependent linear combination of the true parameter and the noise. We study the bias of $K$-fold cross-validation for choosing the regularization parameter, and propose a simple bias-correction. We analyze the accuracy of primal and dual sketching for ridge regression, showing they are surprisingly accurate. Our results are illustrated by simulations and by analyzing empirical data.

### A review of machine learning applications in wildfire science and management

Artificial intelligence has been applied in wildfire science and management since the 1990s, with early applications including neural networks and expert systems. Since then the field has rapidly progressed congruently with the wide adoption of machine learning (ML) in the environmental sciences. Here, we present a scoping review of ML in wildfire science and management. Our objective is to improve awareness of ML among wildfire scientists and managers, as well as illustrate the challenging range of problems in wildfire science available to data scientists. We first present an overview of popular ML approaches used in wildfire science to date, and then review their use in wildfire science within six problem domains: 1) fuels characterization, fire detection, and mapping; 2) fire weather and climate change; 3) fire occurrence, susceptibility, and risk; 4) fire behavior prediction; 5) fire effects; and 6) fire management. We also discuss the advantages and limitations of various ML approaches and identify opportunities for future advances in wildfire science and management within a data science context. We identified 298 relevant publications, where the most frequently used ML methods included random forests, MaxEnt, artificial neural networks, decision trees, support vector machines, and genetic algorithms. There exists opportunities to apply more current ML methods (e.g., deep learning and agent based learning) in wildfire science. However, despite the ability of ML models to learn on their own, expertise in wildfire science is necessary to ensure realistic modelling of fire processes across multiple scales, while the complexity of some ML methods requires sophisticated knowledge for their application. Finally, we stress that the wildfire research and management community plays an active role in providing relevant, high quality data for use by practitioners of ML methods.

### Intelligence, physics and information -- the tradeoff between accuracy and simplicity in machine learning

How can we enable machines to make sense of the world, and become better at learning? To approach this goal, I believe viewing intelligence in terms of many integral aspects, and also a universal two-term tradeoff between task performance and complexity, provides two feasible perspectives. In this thesis, I address several key questions in some aspects of intelligence, and study the phase transitions in the two-term tradeoff, using strategies and tools from physics and information. Firstly, how can we make the learning models more flexible and efficient, so that agents can learn quickly with fewer examples? Inspired by how physicists model the world, we introduce a paradigm and an AI Physicist agent for simultaneously learning many small specialized models (theories) and the domain they are accurate, which can then be simplified, unified and stored, facilitating few-shot learning in a continual way. Secondly, for representation learning, when can we learn a good representation, and how does learning depend on the structure of the dataset? We approach this question by studying phase transitions when tuning the tradeoff hyperparameter. In the information bottleneck, we theoretically show that these phase transitions are predictable and reveal structure in the relationships between the data, the model, the learned representation and the loss landscape. Thirdly, how can agents discover causality from observations? We address part of this question by introducing an algorithm that combines prediction and minimizing information from the input, for exploratory causal discovery from observational time series. Fourthly, to make models more robust to label noise, we introduce Rank Pruning, a robust algorithm for classification with noisy labels. I believe that building on the work of my thesis we will be one step closer to enable more intelligent machines that can make sense of the world.