Goto

Collaborating Authors

 field data


CosmoFlow: Scale-Aware Representation Learning for Cosmology with Flow Matching

Kannan, Sidharth, Qiu, Tian, Cuesta-Lazaro, Carolina, Jeong, Haewon

arXiv.org Artificial Intelligence

The large-scale structure of the Universe provides one of the most stringent tests of gravity on cosmological scales. Over the past decades, the ΛCDM cosmological model has emerged as the standard framework for understanding our cosmos, where Λ represents the cosmological constant (associated with dark energy) and CDM denotes cold dark matter--which together comprise approximately 95% of the Universe's energy budget. Theoretical predictions of ΛCDM can now be implemented with remarkable precision in numerical simulations, which capture the formation of the cosmic web: an intricate network where galaxies reside in dense clusters, connected by filamen-tary structures and separated by vast cosmic voids. This success, however, presents cosmology with a new challenge. High-resolution simulations like AbacusSummit generate datasets exceeding 2000 TB, severely constraining our ability to scale training datasets for machine learning applications. Moreover, extracting meaningful insights from these high-dimensional datasets requires models that can effectively navigate the curse of dimensionality.


A generative foundation model for an all-in-one seismic processing framework

Cheng, Shijun, Harsuko, Randy, Alkhalifah, Tariq

arXiv.org Artificial Intelligence

Seismic data often face challenges in their utilization due to noise contamination, incomplete acquisition, and limited low-frequency information, which hinder accurate subsurface imaging and interpretation. Traditional processing methods rely heavily on task-specific designs to address these challenges and fail to account for the variability of data. To address these limitations, we present a generative seismic foundation model (GSFM), a unified framework based on generative diffusion models (GDMs), designed to tackle multi-task seismic processing challenges, including denoising, backscattered noise attenuation, interpolation, and low-frequency extrapolation. GSFM leverages a pre-training stage on synthetic data to capture the features of clean, complete, and broadband seismic data distributions and applies an iterative fine-tuning strategy to adapt the model to field data. By adopting a target-oriented diffusion process prediction, GSFM improves computational efficiency without compromising accuracy. Synthetic data tests demonstrate GSFM surpasses benchmarks with equivalent architectures in all tasks and achieves performance comparable to traditional pre-training strategies, even after their fine-tuning. Also, field data tests suggest that our iterative fine-tuning approach addresses the generalization limitations of conventional pre-training and fine-tuning paradigms, delivering significantly enhanced performance across diverse tasks. Furthermore, GSFM's inherent probabilistic nature enables effective uncertainty quantification, offering valuable insights into the reliability of processing results.


Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings

Saha, Mithun, Xu, Maxwell A., Mao, Wanting, Neupane, Sameer, Rehg, James M., Kumar, Santosh

arXiv.org Artificial Intelligence

Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, the first open-source PPG foundation model trained exclusively on raw PPG data collected over a 100-day field study with 120 participants. Existing PPG foundation models are either open-source but trained on clinical data or closed-source, limiting their applicability in real-world settings. We evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its performance against a state-of-the-art foundation model trained on clinical data. Our results demonstrate that Pulse-PPG, trained on uncurated field data, exhibits superior generalization across clinical and mobile health applications in both lab and field settings. This suggests that exposure to real-world variability enables the model to learn fine-grained representations, making it more adaptable across tasks. Furthermore, pre-training on field data surprisingly outperforms its pre-training on clinical data in many tasks, reinforcing the importance of training on real-world, diverse datasets. To encourage further advancements in robust foundation models leveraging field data, we plan to release Pulse-PPG, providing researchers with a powerful resource for developing more generalizable PPG-based models.


Enhancing Deep Learning based RMT Data Inversion using Gaussian Random Field

Ghosal, Koustav, Singh, Arun, Malakar, Samir, Srivastava, Shalivahan, Gupta, Deepak

arXiv.org Artificial Intelligence

Deep learning (DL) methods have emerged as a powerful tool for the inversion of geophysical data. When applied to field data, these models often struggle without additional fine-tuning of the network. This is because they are built on the assumption that the statistical patterns in the training and test datasets are the same. To address this, we propose a DL-based inversion scheme for Radio Magnetotelluric data where the subsurface resistivity models are generated using Gaussian Random Fields (GRF). The network's generalization ability was tested with an out-of-distribution (OOD) dataset comprising a homogeneous background and various rectangular-shaped anomalous bodies. After end-to-end training with the GRF dataset, the pre-trained network successfully identified anomalies in the OOD dataset. Synthetic experiments confirmed that the GRF dataset enhances generalization compared to a homogeneous background OOD dataset. The network accurately recovered structures in a checkerboard resistivity model, and demonstrated robustness to noise, outperforming traditional gradient-based methods. Finally, the developed scheme is tested using exemplary field data from a waste site near Roorkee, India. The proposed scheme enhances generalization in a data-driven supervised learning framework, suggesting a promising direction for OOD generalization in DL methods.


Physics-guided machine learning predicts the planet-scale performance of solar farms with sparse, heterogeneous, public data

Jahangir, Jabir Bin, Alam, Muhammad Ashraful

arXiv.org Artificial Intelligence

The photovoltaics (PV) technology landscape is evolving rapidly. To predict the potential and scalability of emerging PV technologies, a global understanding of these systems' performance is essential. Traditionally, experimental and computational studies at large national research facilities have focused on PV performance in specific regional climates. However, synthesizing these regional studies to understand the worldwide performance potential has proven difficult. Given the expense of obtaining experimental data, the challenge of coordinating experiments at national labs across a politically-divided world, and the data-privacy concerns of large commercial operators, however, a fundamentally different, data-efficient approach is desired. Here, we present a physics-guided machine learning (PGML) scheme to demonstrate that: (a) The world can be divided into a few PV-specific climate zones, called PVZones, illustrating that the relevant meteorological conditions are shared across continents; (b) by exploiting the climatic similarities, high-quality monthly energy yield data from as few as five locations can accurately predict yearly energy yield potential with high spatial resolution and a root mean square error of less than 8 kWhm$^{2}$, and (c) even with noisy, heterogeneous public PV performance data, the global energy yield can be predicted with less than 6% relative error compared to physics-based simulations provided that the dataset is representative. This PGML scheme is agnostic to PV technology and farm topology, making it adaptable to new PV technologies or farm configurations. The results encourage physics-guided, data-driven collaboration among national policymakers and research organizations to build efficient decision support systems for accelerated PV qualification and deployment across the world.


Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes

Schaeffer, Joachim, Lenz, Eric, Gulla, Duncan, Bazant, Martin Z., Braatz, Richard D., Findeisen, Rolf

arXiv.org Artificial Intelligence

Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article. Lithium-Ion Batteries (LIBs) are essential for Electric Vehicles (EVs), grid storage, mobile applications, and consumer electronics. Over the last 30 years, remarkable advances have led to long-lasting cells with high energy efficiency and density [1]. The growth of production volume over the last decade is projected to continue [2, 3] mainly due to EVs and stationary storage, both needed for the transition to a sustainable future.


Comparing remote sensing-based forest biomass mapping approaches using new forest inventory plots in contrasting forests in northeastern and southwestern China

Dong, Wenquan, Mitchard, Edward T. A., Chen, Yuwei, Chen, Man, Cao, Congfeng, Hu, Peilun, Xu, Cong, Hancock, Steven

arXiv.org Artificial Intelligence

Large-scale high spatial resolution aboveground biomass (AGB) maps play a crucial role in determining forest carbon stocks and how they are changing, which is instrumental in understanding the global carbon cycle, and implementing policy to mitigate climate change. The advent of the new space-borne LiDAR sensor, NASA's GEDI instrument, provides unparalleled possibilities for the accurate and unbiased estimation of forest AGB at high resolution, particularly in dense and tall forests, where Synthetic Aperture Radar (SAR) and passive optical data exhibit saturation. However, GEDI is a sampling instrument, collecting dispersed footprints, and its data must be combined with that from other continuous cover satellites to create high-resolution maps, using local machine learning methods. In this study, we developed local models to estimate forest AGB from GEDI L2A data, as the models used to create GEDI L4 AGB data incorporated minimal field data from China. We then applied LightGBM and random forest regression to generate wall-to-wall AGB maps at 25 m resolution, using extensive GEDI footprints as well as Sentinel-1 data, ALOS-2 PALSAR-2 and Sentinel-2 optical data. Through a 5-fold cross-validation, LightGBM demonstrated a slightly better performance than Random Forest across two contrasting regions. However, in both regions, the computation speed of LightGBM is substantially faster than that of the random forest model, requiring roughly one-third of the time to compute on the same hardware. Through the validation against field data, the 25 m resolution AGB maps generated using the local models developed in this study exhibited higher accuracy compared to the GEDI L4B AGB data. We found in both regions an increase in error as slope increased. The trained models were tested on nearby but different regions and exhibited good performance.


Temporal assessment of malicious behaviors: application to turnout field data monitoring

Abdellaoui, Sara, Dumitrescu, Emil, Escudero, Cédric, Zamaï, Eric

arXiv.org Artificial Intelligence

This information was projected on the life cycle of the Their distributed communicating nature makes them vulnerable turnout according to time aging and operation aging to cyberattacks [2]. The security of CPS has criteria in order to compute a cyberthreat likelihood for emerged as a complex problem, after discovering the each current curve observed. Maintenance operators use Stuxnet malware [3] that targeted the Iranian industrial the estimated likelihood to assess the authenticity of each control system.


Bathymetric Surveying with Imaging Sonar Using Neural Volume Rendering

Xie, Yiping, Troni, Giancarlo, Bore, Nils, Folkesson, John

arXiv.org Artificial Intelligence

This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.


Psittacines of Innovation? Assessing the True Novelty of AI Creations

Mukherjee, Anirban

arXiv.org Artificial Intelligence

We examine whether Artificial Intelligence (AI) systems generate truly novel ideas rather than merely regurgitating patterns learned during training. Utilizing a novel experimental design, we task an AI with generating project titles for hypothetical crowdfunding campaigns. We compare within AI-generated project titles, measuring repetition and complexity. We compare between the AI-generated titles and actual observed field data using an extension of maximum mean discrepancy--a metric derived from the application of kernel mean embeddings of statistical distributions to high-dimensional machine learning (large language) embedding vectors--yielding a structured analysis of AI output novelty. Results suggest that (1) the AI generates unique content even under increasing task complexity, and at the limits of its computational capabilities, (2) the generated content has face validity, being consistent with both inputs to other generative AI and in qualitative comparison to field data, and (3) exhibits divergence from field data, mitigating concerns relating to intellectual property rights. We discuss implications for copyright and trademark law.