Goto

Collaborating Authors

 water resource research


Knowledge distillation as a pathway toward next-generation intelligent ecohydrological modeling systems

Jiang, Long, Yang, Yang, Chui, Ting Fong May, Thornwell, Morgan, Gupta, Hoshin Vijai

arXiv.org Artificial Intelligence

Simulating ecohydrological processes is essential for understanding complex environmental systems and guiding sustainable management amid accelerating climate change and human pressures. Process-based models provide physical realism but can suffer from structural rigidity, high computational costs, and complex calibration, while machine learning (ML) methods are efficient and flexible yet often lack interpretability and transferability. We propose a unified three-phase framework that integrates process-based models with ML and progressively embeds them into artificial intelligence (AI) through knowledge distillation. Phase I, behavioral distillation, enhances process models via surrogate learning and model simplification to capture key dynamics at lower computational cost. Phase II, structural distillation, reformulates process equations as modular components within a graph neural network (GNN), enabling multiscale representation and seamless integration with ML models. Phase III, cognitive distillation, embeds expert reasoning and adaptive decision-making into intelligent modeling agents using the Eyes-Brain-Hands-Mouth architecture. Demonstrations for the Samish watershed highlight the framework's applicability to ecohydrological modeling, showing that it can reproduce process-based model outputs, improve predictive accuracy, and support scenario-based decision-making. The framework offers a scalable and transferable pathway toward next-generation intelligent ecohydrological modeling systems, with the potential extension to other process-based domains.


Benchmark Dataset for Pore-Scale CO2-Water Interaction

Abdellatif, Alhasan, Menke, Hannah P., Maes, Julien, Elsheikh, Ahmed H., Doster, Florian

arXiv.org Artificial Intelligence

Accurately capturing the complex interaction between CO2 and water in porous media at the pore scale is essential for various geoscience applications, including carbon capture and storage (CCS). We introduce a comprehensive dataset generated from high-fidelity numerical simulations to capture the intricate interaction between CO2 and water at the pore scale. The dataset consists of 624 2D samples, each of size 512x512 with a resolution of 35 {\mu}m, covering 100 time steps under a constant CO2 injection rate. It includes various levels of heterogeneity, represented by different grain sizes with random variation in spacing, offering a robust testbed for developing predictive models. This dataset provides high-resolution temporal and spatial information crucial for benchmarking machine learning models.


A Deep-Learning Iterative Stacked Approach for Prediction of Reactive Dissolution in Porous Media

Cirne, Marcos, Menke, Hannah, Abdellatif, Alhasan, Maes, Julien, Doster, Florian, Elsheikh, Ahmed H.

arXiv.org Artificial Intelligence

Simulating reactive dissolution of solid minerals in porous media has many subsurface applications, including carbon capture and storage (CCS), geothermal systems and oil & gas recovery. As traditional direct numerical simulators are computationally expensive, it is of paramount importance to develop faster and more efficient alternatives. Deep-learning-based solutions, most of them built upon convolutional neural networks (CNNs), have been recently designed to tackle this problem. However, these solutions were limited to approximating one field over the domain (e.g. velocity field). In this manuscript, we present a novel deep learning approach that incorporates both temporal and spatial information to predict the future states of the dissolution process at a fixed time-step horizon, given a sequence of input states. The overall performance, in terms of speed and prediction accuracy, is demonstrated on a numerical simulation dataset, comparing its prediction results against state-of-the-art approaches, also achieving a speedup around $10^4$ over traditional numerical simulators.


Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics

Wang, Yuan-Heng, Gupta, Hoshin V.

arXiv.org Artificial Intelligence

Despite the excellent real-world predictive performance of modern machine learning (ML) methods, many scientists remain hesitant to discard traditional physical-conceptual (PC) approaches due mainly to their relative interpretability, which contributes to credibility during decision-making. In this context, a currently underexplored aspect of ML is how to develop "minimally-optimal" representations that can facilitate better "insight regarding system functioning". Regardless of how this is achieved, it is arguably true that parsimonious representations better support the advancement of scientific understanding. Our own view is that ML-based modeling of geoscientific systems should be based in the use of computational units that are fundamentally interpretable by design. This paper continues our exploration of how the strengths of ML can be exploited in the service of better understanding via scientific investigation. Here, we use the Mass Conserving Perceptron (MCP) as the fundamental computational unit in a generic network architecture consisting of nodes arranged in series and parallel to explore several generic and important issues related to the use of observational data for constructing input-state-output models of dynamical systems. In the context of lumped catchment modeling, we show that physical interpretability and excellent predictive performance can both be achieved using a relatively parsimonious "distributed-state" multiple-flowpath network with context-dependent gating and "information sharing" across the nodes, suggesting that MCP-based modeling can play a significant role in application of ML to geoscientific investigation.


GeoFUSE: A High-Efficiency Surrogate Model for Seawater Intrusion Prediction and Uncertainty Reduction

Jiang, Su, Liu, Chuyang, Dwivedi, Dipankar

arXiv.org Artificial Intelligence

Seawater intrusion into coastal aquifers poses a significant threat to groundwater resources, especially with rising sea levels due to climate change. Accurate modeling and uncertainty quantification of this process are crucial but are often hindered by the high computational costs of traditional numerical simulations. In this work, we develop GeoFUSE, a novel deep-learning-based surrogate framework that integrates the U-Net Fourier Neural Operator (U-FNO) with Principal Component Analysis (PCA) and Ensemble Smoother with Multiple Data Assimilation (ESMDA). GeoFUSE enables fast and efficient simulation of seawater intrusion while significantly reducing uncertainty in model predictions. We apply GeoFUSE to a 2D cross-section of the Beaver Creek tidal stream-floodplain system in Washington State. Using 1,500 geological realizations, we train the U-FNO surrogate model to approximate salinity distribution and accumulation. The U-FNO model successfully reduces the computational time from hours (using PFLOTRAN simulations) to seconds, achieving a speedup of approximately 360,000 times while maintaining high accuracy. By integrating measurement data from monitoring wells, the framework significantly reduces geological uncertainty and improves the predictive accuracy of the salinity distribution over a 20-year period. Our results demonstrate that GeoFUSE improves computational efficiency and provides a robust tool for real-time uncertainty quantification and decision making in groundwater management. Future work will extend GeoFUSE to 3D models and incorporate additional factors such as sea-level rise and extreme weather events, making it applicable to a broader range of coastal and subsurface flow systems.


Evaluating Deep Learning Approaches for Predictions in Unmonitored Basins with Continental-scale Stream Temperature Models

Willard, Jared D., Ciulla, Fabio, Weierbach, Helen, Kumar, Vipin, Varadharajan, Charuleka

arXiv.org Artificial Intelligence

The prediction of streamflows and other environmental variables in unmonitored basins is a grand challenge in hydrology. Recent machine learning (ML) models can harness vast datasets for accurate predictions at large spatial scales. However, there are open questions regarding model design and data needed for inputs and training to improve performance. This study explores these questions while demonstrating the ability of deep learning models to make accurate stream temperature predictions in unmonitored basins across the conterminous United States. First, we compare top-down models that utilize data from a large number of basins with bottom-up methods that transfer ML models built on local sites, reflecting traditional regionalization techniques. We also evaluate an intermediary grouped modeling approach that categorizes sites based on regional co-location or similarity of catchment characteristics. Second, we evaluate trade-offs between model complexity, prediction accuracy, and applicability for more target locations by systematically removing inputs. We then examine model performance when additional training data becomes available due to reductions in input requirements. Our results suggest that top-down models significantly outperform bottom-up and grouped models. Moreover, it is possible to get acceptable accuracy by reducing both dynamic and static inputs enabling predictions for more sites with lower model complexity and computational needs. From detailed error analysis, we determined that the models are more accurate for sites primarily controlled by air temperatures compared to locations impacted by groundwater and dams. By addressing these questions, this research offers a comprehensive perspective on optimizing ML model design for accurate predictions in unmonitored regions.


Methods to improve run time of hydrologic models: opportunities and challenges in the machine learning era

Dhital, Supath

arXiv.org Artificial Intelligence

The application of Machine Learning (ML) to hydrologic modeling is fledgling. Its applicability to capture the dependencies on watersheds to forecast better within a short period is fascinating. One of the key reasons to adopt ML algorithms over physics-based models is its computational efficiency advantage and flexibility to work with various data sets. The diverse applications, particularly in emergency response and expanding over a large scale, demand the hydrological model in a short time and make researchers adopt data-driven modeling approaches unhesitatingly. In this work, in the era of ML and deep learning (DL), how it can help to improve the overall run time of physics-based model and potential constraints that should be addressed while modeling. This paper covers the opportunities and challenges of adopting ML for hydrological modeling and subsequently how it can help to improve the simulation time of physics-based models and future works that should be addressed.


Towards Interpretable Physical-Conceptual Catchment-Scale Hydrological Modeling using the Mass-Conserving-Perceptron

Wang, Yuan-Heng, Gupta, Hoshin V.

arXiv.org Artificial Intelligence

We investigate the applicability of machine learning technologies to the development of parsimonious, interpretable, catchment-scale hydrologic models using directed-graph architectures based on the mass-conserving perceptron (MCP) as the fundamental computational unit. Here, we focus on architectural complexity (depth) at a single location, rather than universal applicability (breadth) across large samples of catchments. The goal is to discover a minimal representation (numbers of cell-states and flow paths) that represents the dominant processes that can explain the input-state-output behaviors of a given catchment, with particular emphasis given to simulating the full range (high, medium, and low) of flow dynamics. We find that a HyMod-like architecture with three cell-states and two major flow pathways achieves such a representation at our study location, but that the additional incorporation of an input-bypass mechanism significantly improves the timing and shape of the hydrograph, while the inclusion of bi-directional groundwater mass exchanges significantly enhances the simulation of baseflow. Overall, our results demonstrate the importance of using multiple diagnostic metrics for model evaluation, while highlighting the need for designing training metrics that are better suited to extracting information across the full range of flow dynamics. Further, they set the stage for interpretable regional-scale MCP-based hydrological modeling (using large sample data) by using neural architecture search to determine appropriate minimal representations for catchments in different hydroclimatic regimes.


Randomized Physics-Informed Machine Learning for Uncertainty Quantification in High-Dimensional Inverse Problems

Zong, Yifei, Barajas-Solano, David, Tartakovsky, Alexandre M.

arXiv.org Artificial Intelligence

We propose a physics-informed machine learning method for uncertainty quantification in high-dimensional inverse problems. In this method, the states and parameters of partial differential equations (PDEs) are approximated with truncated conditional Karhunen-Lo\`eve expansions (CKLEs), which, by construction, match the measurements of the respective variables. The maximum a posteriori (MAP) solution of the inverse problem is formulated as a minimization problem over CKLE coefficients where the loss function is the sum of the norm of PDE residuals and the $\ell_2$ regularization term. This MAP formulation is known as the physics-informed CKLE (PICKLE) method. Uncertainty in the inverse solution is quantified in terms of the posterior distribution of CKLE coefficients, and we sample the posterior by solving a randomized PICKLE minimization problem, formulated by adding zero-mean Gaussian perturbations in the PICKLE loss function. We call the proposed approach the randomized PICKLE (rPICKLE) method. For linear and low-dimensional nonlinear problems (15 CKLE parameters), we show analytically and through comparison with Hamiltonian Monte Carlo (HMC) that the rPICKLE posterior converges to the true posterior given by the Bayes rule. For high-dimensional non-linear problems with 2000 CKLE parameters, we numerically demonstrate that rPICKLE posteriors are highly informative--they provide mean estimates with an accuracy comparable to the estimates given by the MAP solution and the confidence interval that mostly covers the reference solution. We are not able to obtain the HMC posterior to validate rPICKLE's convergence to the true posterior due to the HMC's prohibitive computational cost for the considered high-dimensional problems. Our results demonstrate the advantages of rPICKLE over HMC for approximately sampling high-dimensional posterior distributions subject to physics constraints.


Learning to Generate Lumped Hydrological Models

Yang, Yang, Chui, Ting Fong May

arXiv.org Artificial Intelligence

A lumped hydrological model structure can be considered a generative model because, given a set of parameter values, it can generate a hydrological modeling function that accurately predicts the behavior of a catchment under external forcing. It is implicitly assumed that a small number of variables (i.e., the model parameters) can sufficiently characterize variations in the behavioral characteristics of different catchments. This study adopts this assumption and uses a deep learning method to learn a generative model of hydrological modeling functions directly from the forcing and runoff data of multiple catchments. The learned generative model uses a small number of latent variables to characterize a catchment's behavior, so that assigning values to these latent variables produces a hydrological modeling function that resembles a real-world catchment. The learned generative model can be used similarly to a lumped model structure, i.e., the optimal hydrological modeling function of a catchment can be derived by estimating optimal parameter values (or latent variables) with a generic calibration algorithm. In this study, a generative model was learned from data from over 3,000 catchments worldwide. The model was then used to derive optimal modeling functions for over 700 different catchments. The resulting modeling functions generally showed a quality that was comparable to or better than 36 types of lumped model structures. Overall, this study demonstrates that the hydrological behavior of a catchment can be effectively described using a small number of latent variables, and that well-fitting hydrologic model functions can be reconstructed from these variables.