Many practical combustion systems such as those in rockets, gas turbines, and internal combustion engines operate under high pressures that surpass the thermodynamic critical limit of fuel-oxidizer mixtures. These conditions require the consideration of complex fluid behaviors that pose challenges for numerical simulations, casting doubts on the validity of existing subgrid-scale (SGS) models in large-eddy simulations of these systems. While data-driven methods have shown high accuracy as closure models in simulations of turbulent flames, these models are often criticized for lack of physical interpretability, wherein they provide answers but no insight into their underlying rationale. The objective of this study is to assess SGS stress models from conventional physics-driven approaches and an interpretable machine learning algorithm, i.e., the random forest regressor, in a turbulent transcritical non-premixed flame. To this end, direct numerical simulations (DNS) of transcritical liquid-oxygen/gaseous-methane (LOX/GCH4) inert and reacting flows are performed. Using this data, a priori analysis is performed on the Favre-filtered DNS data to examine the accuracy of physics-based and random forest SGS-models under these conditions. SGS stresses calculated with the gradient model show good agreement with the exact terms extracted from filtered DNS. The accuracy of the random-forest regressor decreased when physics-based constraints are applied to the feature set. Results demonstrate that random forests can perform as effectively as algebraic models when modeling subgrid stresses, only when trained on a sufficiently representative database. The employment of random forest feature importance score is shown to provide insight into discovering subgrid-scale stresses through sparse regression.
This graduate textbook on machine learning tells a story of how patterns in data support predictions and consequential actions. Starting with the foundations of decision making, we cover representation, optimization, and generalization as the constituents of supervised learning. A chapter on datasets as benchmarks examines their histories and scientific bases. Self-contained introductions to causality, the practice of causal inference, sequential decision making, and reinforcement learning equip the reader with concepts and tools to reason about actions and their consequences. Throughout, the text discusses historical context and societal impact. We invite readers from all backgrounds; some experience with probability, calculus, and linear algebra suffices.
The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time. Our results demonstrate that one can learn the optimal rule based on a single sample, and thereby adjust the design at any point t with valid inference for the mean target parameter. This work provides several contributions to the field of statistical precision medicine. First, we define a general class of averages of conditional causal parameters defined by the current context for the single unit time-series data. We define a nonparametric model for the probability distribution of the time-series under few assumptions, and aim to fully utilize the sequential randomization in the estimation procedure via the double robust structure of the efficient influence curve of the proposed target parameter. We present multiple exploration-exploitation strategies for assigning treatment, and methods for estimating the optimal rule. Lastly, we present the study of the data-adaptive inference on the mean under the optimal treatment rule, where the target parameter adapts over time in response to the observed context of the individual. Our target parameter is pathwise differentiable with an efficient influence function that is doubly robust - which makes it easier to estimate than previously proposed variations. We characterize the limit distribution of our estimator under a Donsker condition expressed in terms of a notion of bracketing entropy adapted to martingale settings.
Feature attribution is widely used in interpretable machine learning to explain how influential each measured input feature value is for an output inference. However, measurements can be uncertain, and it is unclear how the awareness of input uncertainty can affect the trust in explanations. We propose and study two approaches to help users to manage their perception of uncertainty in a model explanation: 1) transparently show uncertainty in feature attributions to allow users to reflect on, and 2) suppress attribution to features with uncertain measurements and shift attribution to other features by regularizing with an uncertainty penalty. Through simulation experiments, qualitative interviews, and quantitative user evaluations, we identified the benefits of moderately suppressing attribution uncertainty, and concerns regarding showing attribution uncertainty. This work adds to the understanding of handling and communicating uncertainty for model interpretability.
Random-walk based network embedding algorithms like node2vec and DeepWalk are widely used to obtain Euclidean representation of the nodes in a network prior to performing down-stream network inference tasks. Nevertheless, despite their impressive empirical performance, there is a lack of theoretical results explaining their behavior. In this paper we studied the node2vec and DeepWalk algorithms through the perspective of matrix factorization. We analyze these algorithms in the setting of community detection for stochastic blockmodel graphs; in particular we established large-sample error bounds and prove consistent community recovery of node2vec/DeepWalk embedding followed by k-means clustering. Our theoretical results indicate a subtle interplay between the sparsity of the observed networks, the window sizes of the random walks, and the convergence rates of the node2vec/DeepWalk embedding toward the embedding of the true but unknown edge probabilities matrix. More specifically, as the network becomes sparser, our results suggest using larger window sizes, or equivalently, taking longer random walks, in order to attain better convergence rate for the resulting embeddings. The paper includes numerical experiments corroborating these observations.
We analyze the reservoir computation capability of the Lang-Kobayashi system by comparing the numerically computed recall capabilities and the eigenvalue spectrum. We show that these two quantities are deeply connected, and thus the reservoir computing performance is predictable by analyzing the eigenvalue spectrum. Our results suggest that any dynamical system used as a reservoir can be analyzed in this way as long as the reservoir perturbations are sufficiently small. Optimal performance is found for a system with the eigenvalues having real parts close to zero and off-resonant imaginary parts.
Improving and optimizing oceanographic sampling is a crucial task for marine science and maritime resource management. Faced with limited resources in understanding processes in the water-column, the combination of statistics and autonomous systems provide new opportunities for experimental design. In this work we develop efficient spatial sampling methods for characterizing regions defined by simultaneous exceedances above prescribed thresholds of several responses, with an application focus on mapping coastal ocean phenomena based on temperature and salinity measurements. Specifically, we define a design criterion based on uncertainty in the excursions of vector-valued Gaussian random fields, and derive tractable expressions for the expected integrated Bernoulli variance reduction in such a framework. We demonstrate how this criterion can be used to prioritize sampling efforts at locations that are ambiguous, making exploration more effective. We use simulations to study and compare properties of the considered approaches, followed by results from field deployments with an autonomous underwater vehicle as part of a study mapping the boundary of a river plume. The results demonstrate the potential of combining statistical methods and robotic platforms to effectively inform and execute data-driven environmental sampling.
Estimation of the soil organic carbon content is of utmost importance in understanding the chemical, physical, and biological functions of the soil. This study proposes machine learning algorithms of support vector machines, artificial neural networks, regression tree, random forest, extreme gradient boosting, and conventional deep neural network for advancing prediction models of SOC. Models are trained with 1879 composite surface soil samples, and 105 auxiliary data as predictors. The genetic algorithm is used as a feature selection approach to identify effective variables. The results indicate that precipitation is the most important predictor driving 15 percent of SOC spatial variability followed by the normalized difference vegetation index, day temperature index of moderate resolution imaging spectroradiometer, multiresolution valley bottom flatness and land use, respectively. Based on 10 fold cross validation, the DNN model reported as a superior algorithm with the lowest prediction error and uncertainty. In terms of accuracy, DNN yielded a mean absolute error of 59 percent, a root mean squared error of 75 percent, a coefficient of determination of 0.65, and Lins concordance correlation coefficient of 0.83. The SOC content was the highest in udic soil moisture regime class with mean values of 4 percent, followed by the aquic and xeric classes, respectively. Soils in dense forestlands had the highest SOC contents, whereas soils of younger geological age and alluvial fans had lower SOC. The proposed DNN is a promising algorithm for handling large numbers of auxiliary data at a province scale, and due to its flexible structure and the ability to extract more information from the auxiliary data surrounding the sampled observations, it had high accuracy for the prediction of the SOC baseline map and minimal uncertainty.
Failure in brittle materials led by the evolution of micro- to macro-cracks under repetitive or increasing loads is often catastrophic with no significant plasticity to advert the onset of fracture. Early failure detection with respective location are utterly important features in any practical application, both of which can be effectively addressed using artificial intelligence. In this paper, we develop a supervised machine learning (ML) framework to predict failure in an isothermal, linear elastic and isotropic phase-field model for damage and fatigue of brittle materials. Time-series data of the phase-field model is extracted from virtual sensing nodes at different locations of the geometry. A pattern recognition scheme is introduced to represent time-series data/sensor nodes responses as a pattern with a corresponding label, integrated with ML algorithms, used for damage classification with identified patterns. We perform an uncertainty analysis by superposing random noise to the time-series data to assess the robustness of the framework with noise-polluted data. Results indicate that the proposed framework is capable of predicting failure with acceptable accuracy even in the presence of high noise levels. The findings demonstrate satisfactory performance of the supervised ML framework, and the applicability of artificial intelligence and ML to a practical engineering problem, i.,e, data-driven failure prediction in brittle materials.
Change point detection is an important part of time series analysis, as the presence of a change point indicates an abrupt and significant change in the data generating process. While many algorithms for change point detection exist, little attention has been paid to evaluating their performance on real-world time series. Algorithms are typically evaluated on simulated data and a small number of commonly-used series with unreliable ground truth. Clearly this does not provide sufficient insight into the comparative performance of these algorithms. Therefore, instead of developing yet another change point detection method, we consider it vastly more important to properly evaluate existing algorithms on real-world data. To achieve this, we present the first data set specifically designed for the evaluation of change point detection algorithms, consisting of 37 time series from various domains. Each time series was annotated by five expert human annotators to provide ground truth on the presence and location of change points. We analyze the consistency of the human annotators, and describe evaluation metrics that can be used to measure algorithm performance in the presence of multiple ground truth annotations. Subsequently, we present a benchmark study where 13 existing algorithms are evaluated on each of the time series in the data set. This study shows that binary segmentation (Scott and Knott, 1974) and Bayesian online change point detection (Adams and MacKay, 2007) are among the best performing methods. Our aim is that this data set will serve as a proving ground in the development of novel change point detection algorithms.